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Abstract — Inference problems in graphical models can be rep- 
resented as a constrained optimization of a free energy function. 
In this paper we treat both forms of probabilistic inference, 
estimating marginal probabilities of the joint distribution and 
finding the most probable assignment, through a unified message- 
passing algorithm architecture. In particular we generalize the 
Belief Propagation (BP) algorithms of sum-product and max- 
product and tree-rewaighted (TRW) sum and max product algo- 
rithms (TRBP) and introduce a new set of convergent algorithms 
based on "convex-free-energy" and Linear-Programming (LP) 
relaxation as a zero-temprature of a convex-free-energy. The main 
idea of this work arises from taking a general perspective on the 
existing BP and TRBP algorithms while observing that they all 
are reductions from the basic optimization formula of f + ^ihi 
where the function / is an extended-valued, strictly convex but 
non-smooth and the functions hi are extended-valued functions 
(not necessarily convex). We use tools from convex duality to 
present the "primal-dual ascent" algorithm which is an extension 
of the Bregman successive projection scheme and is designed to 
handle optimization of the general type / + fti. We then map 
the fractional-free-energy variational principle for approximate 
inference onto the optimization formula above and introduce 
the "norm-product" message-passing algorithm. Special cases of 
the norm-product include sum-product and max-product (BP 
algorithms), TRBP and NMPLP algorithms. When the fractional- 
free-energy is set to be convex (convex-free-energy) the norm- 
product is globally convergent for the estimation of marginal 
probabilities and for approximating the LP-relaxation. We also 
introduce another branch of the norm-product which arises 
as the "zero-temerature" of the convex-free-energy which we 
refer to as the "convex-max-product". The convex-max-product 
is convergent (unlike max-product) and aims at solving the LP- 
relaxation. 

Index Tenns — Approximate inference, Bethe free energy, Breg- 
man projection, convex free energy, dual block ascent, Fenchel 
duality, graphical models, linear programming (LP) relaxation, 
Markov random fields (MRF), maximum a posteriori proba- 
bility (MAP) estimation, max-product algorithm, sum-product 
algorithm. 



I. Introduction 

PROBABISITIC graphical models present a convenient 
and popular tool for reasoning about complex distri- 
butions. The graphical model reflects the way the complex 
distribution p(a;i, x„) factors into a product of potential 
functions, each defined over a small number of variables, 
and referred to as factors. A graphical model, which defined 
in terms of factor graphs, represents the incidence between 
factors and the variables by a bipartite graph with one set of 
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nodes corresponding to the variables of the joint distribution 
and another set of nodes standing for the factors. An edge 
exists between a variable node and a factor node if the variable 
is contained in the set of variables represented by the factor. 
In many applications of interest the factor graph is sparse. In 
other words, in the modeling of the joint behavior of a set of 
interacting variables it is often the case that only a small subset 
of variables interact directly. For example, in the domain of 
image processing, if we think of each pixel as a variable in 
a joint distribution over all image pixels then, typically the 
intensity value of a single pixel will depend most strongly 
on neighboring pixels in the image, rather than on those at a 
distant location. Without the local interaction assumption, i.e., 
if each variable interacts directly with all other variables, then 
the inference of the joint behavior would be a hopeless task. 

Problems involving inference using graphical models comes 
up in a wide range of applications covering a variety of disci- 
plines. Those include digital communications (error correcting 
codes ifTSl ). computer vision 1551 . medical diagnosis |25|, 
protein folding fS8|, computer graphics fTT], |9|, clustering 
1491 . as well as other broad disciplines which include signal 
processing, artificial intelligence and statistical physics ifTSl . 

Probabilistic inference comes in two distinct forms and 
typically involve two slightly different algorithmic thrusts. One 
form of inference task is to obtain one global state of the 
joint distribution that is most probable, i.e., find the values 
of xi,...,Xn which maximizes p(2;i, a;„). This form of 
inference is typically referred to as the maximal a-posteriori 
assignment, or in its abbreviated form, the MAP assignment. 
The second type of inference has the objective of obtaining 
marginal probabilities for some subset of variables given 
evidence (value of) about other variables. For example, if 
Xi E {!,..., n^} then p{xi) comes out of summing exponen- 
tiafly many elements J2{xi x„}\xi Pi^^' ■■■T^n) resulting in 
the likelihood of Xi to obtain each of its possible rii values. 
In this paper, we will focus on both inference problems with 
the objective of introducing a unifying algorithmic thrust. 

Exact inference is NP-hard ll50ll . thus introducing the need 
to derive algorithms for approximate inference. One of the 
most popular class of methods for inference over (factor) 
graphs are message-passing algorithms which pass messages 
along the edges of the factor graph until convergence is 
reached. The belief-propagation (BP) algorithms Ii44i come 
in two variations: the sum-product algorithm for computing 
marginal probabilities and the max-product algorithm for 
computing the MAP assignment. Citing [69 1, the centrality 
of inference using graphical models and the utility of the 
BP algorithms for solving them is reflected in the fact that 
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equivalent or very similar message-passing algorithms have 
been independently derived under different disciplines. Those 
include the Viterbi algorithm [59J, Gallager's sum-product 
algorithm for decoding low-density parity check codes lfT6l . 
the turbo-decoding algorithm f3\, the Kalman filter for signal 
processing |28|, and the transfer-matrix approach in statistical 
mechanics [IJ. 

The BP algorithms are exact, i.e., the resulting marginal 
probabilities and the MAP assignments are the correct ones, 
when the factor graph is free of cycles — a state of affairs that 
considerably limits the application of those algorithms to solve 
real world problems. Nevertheless, an intriguing feature of BP, 
which most likely is the source for its great popularity, is that it 
is well-defined and often gives surprisingly good approximate 
results for graphical models with cycles. However, in this 
context there are no convergence guarantees (except under 
some special cases 1361, pT]) and the algorithms fail to 
converge in many cases of interest. 

During the past decade there has been much progress 
in putting forward a framework for approximate inference 
using variational principles. It has been shown that the fixed- 
points of the sum-product algorithm (for estimating marginal 
probabilities) correspond to the fixed-points of a constrained 
energy function called the Bethe free energy |69|. The free 
energy arises from the expansion of the KL-divergence be- 
tween the input distribution and its product form. The Bethe 
approximation replaces the entropy term in the free energy by 
the Bethe entropy. The investigation of the stationary points 
of the Bethe free energy yields conditions for convergence 
of BP [20], and lower bounds for the free energy in some 
special cases ll54l . These lower bounds are based on the loop 
calculus framework which considers the Bethe free energy 
as a first order approximation for the free energy |8|. The 
Bethe free energy is exact for factor graphs without cycles, 
as well as convex over the set of constraints (representing 
validity of marginals). When the factor graph has cycles the 
Bethe energy is non-convex and the BP algorithms may fail 
to converge. Although it is possible to derive convergent 
algorithms to a local minima of the Bethe function llTOl . 
II22I the computational cost is large and thus has not gained 
popularity. 

To overcome the difficulty with the non-convexity of the 
Bethe approximation, several authors have introduced a class 
of approximations known as convex free energies which are 
convex over the set of constraints for any factor graph. An 
important member of this class is the tree-reweighted (TRW) 
free energy which consists of a linear combination of free 
energies defined on spanning trees of the factor graph |61 ]. It is 
notable that for this specific member of convex free energies a 
convergent message-passing algorithm, applicable to pairwise 
factors only, has been recently introduced |17|. However, a 
convergent message passing algorithm for the general class of 
convex free energies is still lacking. The existing algorithms 
either employ damping heuristics to ensure convergence in 
practice ||62| or focus on a sub-class of free energies where 
the entropy term is a positive combination of joint entropies 

nm. 

The MAP assignment problem has been shown to be ap- 



proximated by a Linear-Programming (LP) relaxation scheme 
||63]| with message-passing algorithmic attempts as a solution 
1311 . 163, QH, 1661, Ell. Some of these attempts guarantee 
convergence only under special cases (such as binary vari- 
ables), pTI, f65l. Others, such as ITSl . arises as a special 
case of our algorithm. We refer to l37]| for detailed account on 
the connections between these message-passing algorithms. A 
double-loop of message passing using a proximal minimiza- 
tion technique proposed recently by [45 j is convergent but 
at a considerable computational expense. Dual decomposition 
techniques were recently proposed ll30l . l33l, which are related 
to dual subgradient methods for the LP relaxation. 

In this paper, we derive a class of approximate infer- 
ence message-passing algorithms, which we call norm-product 
algorithms, using the notion of free-energy approximation. 
The norm-product is an inference engine covering both the 
estimation of marginal probabilities and the MAP assignment. 
When the Bethe free energy is used as a substitution for 
the free-energy, the norm-product reduces to the sum-product 
and max-product algorithms where the latter emerges as a 
"zero temperature" version of the former. When a convex-free- 
energy is used the norm-product becomes a convergent family 
of algorithms along three strains: (i) a globally convergent 
algorithm, which we call convex-sum-product, for estimating 
marginal probabilities, (ii) a locally convergent algorithm 
emerging as a zero-temperature version of the former strain, 
we call convex-max-product, for estimating the MAP assign- 
ment, and (iii) a globally convergent algorithm for the LP- 
relaxation problem. 

The convex-sum-product algorithm was published in |fT9ll 
with only a brief sketch of the detailed derivation. In this 
paper we have chosen to put a large amount of material in 
appendices. Due to the complexity of the presented material 
and the extensive use of modem optimization infrastructure, 
the body of the paper contains the main "storyline", statements 
and algorithms whereas the detailed proofs and the required 
mathematical infrastructure are contained in appendices. 

II. Notations, Problem Setup and Background 

Let xi,...,Xn be the realizations of n discrete random 
variables where the range of the i'th random variable is 
{!,..., rii}, i.e., Xi £ {!,...,«;}. We consider a joint distri- 
bution p{xi, Xn) and assume that it factors into a product 
of non-negative functions (potentials): 

p{xi,...,Xn) = V^qIXq), (1) 

i=l a=l 

where the functions 4>i{xi) represent "local evidence" or prior 
data on the states of Xi, and the functions tpai^a) have argu- 
ments Xq that are some subset of {xi, and Z is a nor- 
malization constant, typically referred as the partition function. 
For example, p{xi,X2,X3) = {l/Z)ip23ix2,X3)ilJi3{xi,X3) 
has two factors with indices ai — {2,3},a2 = {1,3} and 
= {x2,X3},Xa2 = {xi,X3}, and uniform local evidence 
4>i{xi) = 1 for every i = 1,2,3 and every Xi. 

The factorization structure above defines a hypergraph 
whose nodes represent the n random variables and the subsets 
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of variables correspond to its hyperedges. For example, if 
all factor functions are defined on pairs of random variables 
then the factorization is represented by a graph. A convenient 
way to represent hypergraphs is by a bipartite graph with 
one set of nodes corresponding to the original nodes of the 
hypergraph and the other set corresponds to its hyperedges. 
In the context of graphical models such a bipartite graph 
representation is referred to as a factor graph |35| with 
variable nodes representing (t>i{xi) and a factor node for each 
function V'a(xa). An edge connects a variable node i with 
factor node a if and only if xi E Xq,, i.e., Xi is an argument 
of ijja- We adopt the terminology where N{i) stands for all 
factor nodes that are neighbors of variable node i, i.e., all the 
nodes a for which € x^, and N{a) stands for all variable 
nodes that are neighbors of factor node a. 

We shall focus on the two inference tasks of computing 
marginal probabilities and maximum a-priori (MAP) assign- 
ment. The computation of the marginal probabilities p{xi) ~ 

J2x.\x,Pi^) ^^'^ Pi^a) = Z]x\x„P(^)' requires the summa- 
tion over the states of all the variable nodes not in Xi or 
Xq, respectively. This computation is generally hard because it 
may require summing up exponentially large number of terms 
— thus one seeks efficient ways or approximate solutions for 
the marginals. The MAP assignment is the task of finding a 
state for each Xi that brings the maximal value to the joint 
probability p{xi, a;„). 

The belief-propagation (BP) algorithms, known as sum- 
product and max-product, are two algorithms for computing 
marginal probability and MAP assignment, respectively, that 
can be described in terms of operations on a factor graph. As 
already mentioned in the introduction, the BP algorithms will 
deliver the correct inference, i.e. are exact, if the factor graph 
has no cycles, but are still well defined and often provide good 
approximate results when the factor graph has cycles. 

The BP algorithms are defined in terms of messages be- 
tween variable and factor nodes. The message ma^i{xi) from 
factor node a to variable node i, and the opposite direction 
message ni^a{xi), is a vector over the states of Xi. In the 
sum-product algorithm those have the following form: 

ma^i[Xi) = ^ V'a(Xa) J]^ Tlj^aiXj) 

Xo.\xi j^N{a.)\i 

(t>i{xi) mp^i{xi) 

PeN{'i)\a 

The cx indicates that one can normalize the vector The 
messages ni^a{xi) are usually initialized to the uniform 
vector Upon convergence of the message-passing scheme the 
marginal probabilities p{xi) and p(xq) can be expressed in 
terms of a "pseudo-distribution", also known as beliefs, hi{xi) 
and 6q,(xq,) defined below: 

bi{Xi) cx cj)i{Xi) ITla^^ixi) 
aeN{i) 

"^j-^a {Xj ) 

jeN{a) 

When the factor graph has no cycles the messages converge 
and the beliefs correspond to the marginal probabihties. When 



the factor graph has cycles there is no convergence guarantee 
and, regardless of convergence, the recovered beliefs provide 
only an approximation to the marginal probabilities. 

In the max-product algorithm the messages ma^i{xi) are 
slightly altered: 

ma^i{Xi) = max < V'a(Xa) TT nj^a(Xj) > , 

I I 

L ieA'(Q)\j j 

while ni-^a{xi) remain as in the sum-product algorithm. The 
MAP assignment can be recovered from the beliefs bi{xi) 
when the factor graph is a tree. In such a case, the MAP 
assignment of Xi corresponds to the index of highest entry of 
bi{xi). In general convergence is not guaranteed, and the MAP 
assignment can be recovered only for specific problems, Ii64i . 

na, m, im. 

A. Inference using a Variational Principle 

The BP algorithms apply to tree-structured factor graphs 
yet are well defined for general factor graphs but without 
convergence or accuracy guarantees. The variational principle 
approach, described below, is a decade long effort at providing 
an extended platform from which old, i.e., BP algorithms, and 
new (preferably convergent) algorithms can emerge. 

The variational approach seeks a distribution p{xi, ...,Xn) 
that is as close as possible, in relative entropy terms, to the 
product potentials (pi{xi) and ipa{^a)- Expanding the KL- 
divergence D{p \\ q) = J2x Pi-"") ^''^(Pi^) / ^i^)) between two 
nonnegative vectors results in: 

^(p II n'^^n^")=^(p)' 

i a 

where F{p) is the so called Gibbs-Helmholtz free-energy: 

^(P) ^ S^ix^)p{xi) + 6l„(x„)p(x„) - i7(p) 

= E{p)-H{p) (2) 

The term i?(p) = — X]xP(-'^) liip(x) is the entropy and 9i = 
— h\(f)i and Oa — — In^Q. The linear term i?(p) is often 
referred to as the energy term. By minimizing F{p) over the 
probability simplex V — {p : p > 0, = 1} 

get back the probability distribution defined in eqn. [T] as the 
optimal argument p* — argminpgpF(p), and minus the log 
of the normalization, or equivalently the partition function, as 
the value at the minimum: 

n m 

V* = -\nZ = F{p*). 

i—l a—1 

Since F{p) is strictly convex and the simplex constraints are 
convex, the minimum is unique. So far we have not gained 
anything because the entropy H{p) is computationally in- 
tractable since its evaluation is exponential in n, and satisfying 
the probability simplex constraints is intractable as well. The 
variational methods are based on a tractable approximation to 
the free-energy F{p) by (i) approximating the entropy term 
H{p) by a combination of local entropies over marginal proba- 
bilities p{xi),p{xa), and (ii) by approximating the probability 
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simplex constraints by the so called "marginal consistency" 
constraints. 

In approximate inference, the true marginal distributions 
p{xi) and p(xq) are replaced by "beliefs" bi{xi) and 6q,(xq,) 
which form a "pseudo distribution" in the sense that the beliefs 
might not necessarily arise as marginals of some distribution 
p{xi, ...,Xn)- The probability simplex constraints are replaced 
by marginal consistency constraints h{G) defined below: 

f V 6„(x<,) = 6i(a::0 Vi,QeiV(i) 

L(G)= <^b = {b„b„}: 

The entropy approximation H{b) as a function of the beliefs 
is known as fractional entropy and has the form: 

^c„i?(b„)+^c,i7(b,), (3) 

a i 

where the joint entropy H{ha) = ~ X)x ^a(^a) In^al^a) 
and the local entropy Hihi) — — b{xi)h\b{xi). 

For factor-graphs without cycles, the setting of Cq = 1 
and Ci = 1 — di where di is the degree of the variable 
node associated with Xi in the factor graph, renders the 
approximation H to be exact and equajj to the entropy H. 
Such an approximation is known as the Bethe entropy: 

a i 

Moreover, in the case of a tree, the marginal consistency con- 
straints L(G) are equal to the probability simplex constraints, 
thus making the constrained Bethe free energy problem 

min ei{Xi)bi{Xa) + y^9a{^a)ba{^a) - Hbethe (b) , 

a convex optimization producing the true marginals bi{xi) — 
p{xi) and 6c(xq) ~ p(xq). The constrained optimization is 
defined in terms of beliefs only and is therefore computation- 
ally tractable. However, if the factor graph has cycles, the 
minimizer of the constrained Bethe free energy is not guaran- 
teed to correspond to the true marginals p{xi), p(xq,), and not 
even realizable as a true distribution. Therefore, for general 
factor graphs, the Bethe free energy optimization approach 
finds an approximation to the true marginal probabilities. From 
the optimization point of view, the Bethe free energy is strictly 
convex in the intersection of constraints when the factor graph 
is a tree. When the factor graph has cycles the Bethe energy 
is non-convex and although it is possible to derive convergent 
algorithms to local minima of the Bethe function llTOl , ll22l the 
computational cost is large and thus has not gained popularity. 

What makes the Bethe free energy optimization interesting 
is the observation, first elucidated by 1691 , that when the sum- 
product algorithm converges then it does so to a stationary 
point of the constrained Bethe free energy, i.e., fixed-points of 
the algorithm correspond to stationary points of the variational 
problem. This does not mean that the sum-product algorithm 

'in this case the joint probability can be expressed solely in terms of the 
marginals: p{x\,...,Xn) = YlaVi^a) / YliPi^i)'''^'^ ■ Expanding H(j)) 
produces the Bethe entropy approximation. 



descends on the Bethe free energy (in fact it does not), but that 
near a fixed point things start to behave well. The significance 
of the observation is that it ties the popular sum-product 
algorithm with a specific variational principle and moreover 
it suggests a framework for seeking natural generalizations 
of the Bethe approximation with their associated message- 
passing algorithms. 

Generalizations of the Bethe free energy move along two 
thrusts. The first employs better (higher-order) approxima- 
tions to the entropy and higher-order constraints beyond the 
marginal consistency constraints to better approximate the full 
probability simplex constraints. This effort includes Kikuchi 
free energy, region graphs and other hyper-graph based meth- 
ods 1691 . l29l . The second thrust looks for convergence guar- 
anteed message-passing algorithms by extending the Bethe 
free energy to form a wider class of functions, known as 
convex free energies, which are convex in the intersection of 
marginal consistency constraints. In this paper we focus on the 
second thrust. The inclusion of Kikuchi approximations and 
region graphs is a natural extension to the results we introduce 
in this paper but for the sake of clarity we leave it outside the 
current scope. 

The fractional entropy eqn.|3]can be set to form a family of 
concave approximations. The set of sufficient conditions for 
an entropy approximation of the type of eqn. [3] to be concave 
over the set of constraints was introduced in 1211 . ||65]| and 
take the following form: 

Definition 1 (Concave Entropy Approximation): An 
approximate entropy term of the form eqn.[3]is strictly concave 
over the set of marginal consistency constraints if there exists 

Ci,C,a > and Ca > such that Ca = Ca + Y.ieN{a) ^ia 

and Ci = - I]agAr(i) ^ia- The approximate entropy 
becomes: 

^c,i/(b„)+^c.//(bO+ c,c(i/(b,)-//(bO). (4) 

The entropy approximation Hih) includes the Bethe approx- 
imation when Ca — l-.Ci — I — di and Cia = but it is 
guaranteed to be strictly concave for any setting of the param- 
eters where Ci , Cia > and Cq, > 0. An important member 
of this class is the "tree-reweighted" (TRW) approximation 
[62] where is equal to a weighted combination of spanning 
trees of the original graph (all factors are pairwise and thus a 
represents an edge) which pass through a. In Appendix [P] we 
describe a number of concave settings of H including TRW 
and other heuristic settings. The convex-free-energy variational 
program becomes: 

min y^bi{xi)ei{xi) + ea{xa)ba{xa) - H{h). (5) 

The global minimizer b* of the convex-free-energy program 
above is an approximation to the marginal probabilities due 
to (i) the term H is an approximation to the entropy of the 
distribution and its quality depends on how the parameters 
Cq, Ci, Cia are set and on the structure of the factor graph, and 
(ii) due to the fact that the marginal consistency constraints 
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L(G) approximate the probability simplex constraints, there 
is no guarantee that in general b* form a distribution, i.e., the 
marginal estimations &* (x^) and b*{xi) might not arise from 
any probability distribution over xi, ...,Xn- 

The only guarantees we have is that if the factor graph has 
no cycles then the marginal probabilities are exact and if H 
is strictly concave then it should be possible to generate a 
convergent message-passing algorithm (unlike BP algorithms 
which are not generally convergent). 

We move next to the MAP assignment task where one seeks 
a vector x* which maximizes the product of potentials, or 
equivalently minimizes the energy 

argmax (?!>i ) (^a ) = argmin ^ 6*1 (x^ ) + ^ (a:^ ) . 

X . X . 

la 1 a 

Described as a variational principle program, the MAP 
assignment problem is equivalent to the linear program whose 
variables corresponds to distribution p{xi, ...,Xn) with expo- 
nential many elements: 

The optimization of a linear function over the probability 
simplex yields an optimal solution p* in an extreme point of 
the probability simplex, namely p* is a zero-one distribution. 
In particular p*(x*) 1 and for every x 7^ x* holds 
p*(x) =0. 

An approximation can be obtained by approximating the 
marginal probabilities p{xi) and J3(xq) with beliefs bi{xi) 
and 6q(xq) which are not guaranteed to correspond to a true 
distribution over xi, ...,x„. 

min y^9i{xi)bi{x^) + 9a{:s.a)bai:>Ca) (6) 

If the minimizer of the LP -relaxation problem comes out 
without ties, i.e., the marginal vectors bi{xi) have a single 
maximal entry, then the MAP assignment readily emerges 
from the LP-relaxed solution. This LP-relaxed problem can 
be solved using off-the-shelf LP solvers but the key problem 
with standard LP solvers, however, is that they do not use the 
graph structure explicitly and thus are sub-optimal in terms 
of computational efficiency. An empirical study found the 
message-passing LP-solvers, e.g. max-TRBP, to be superior to 
the CPLEX solver, a commercial LP solver that implements 
different techniques for solving LP, such as primal and dual 
simplex solvers, network solvers, primal-dual barrier solver for 
sparse problem, and sifting techniques executing sequences of 
LP subproblems |67|. 

The relaxed LP problem of eqn. |6] has been widely studied 
in the literature in the context of message-passing algorithms. 
Special cases of these LP-relaxations were used for constraints 
satisfaction f34^]. The general form in eqn. |6]was studied 
using tree decompositions in ||63]| . 130^, as well as dual de- 
composition ||32]| . $33l . and dual block coordinate ascent Il66l . 
|[T8 1, |52|. A general framework for these recent developments 
is described in ll37ll . Since the LP energy is not strictly convex, 
convergence to the global minimum is a challenge, since eqn. 



|6] usually corresponds to a non-smooth dual. In this case a 
dual block coordinate ascent can lead to a corner in the dual 
objective, which is a non-optimal stationary point. 

An alternative class of methods are based on a (strictly) 
convex relaxation approach. There are two notable recent 
examples in this class: one using a proximal minimization 
technique where the convex term is a weighted KL-divergence 
measure between the sought-after belief vector and the one 
from the previous iteration [45 1. The proximal minimization 
approach involves a double-loop of message passing iterations 
and is guaranteed to converge to the global optimum of eqn. [6] 
The second approach, the one we follow in this paper, is to 
make eqn. |6]the "zero temperature" of the perturbed problem: 

min ei{xi)bi{xi) + 9a{xa)ba{xa) - £H{h), (7) 

by taking e — > 0. This approach was used in decoding 
low -density parity-check codes 1601 . It was also used for 
LP-relaxations, to derive a non-convergent max-product like 
algorithm |65|, and for applying an iterative proportional 
fitting type algorithm ||26| . 

This concludes the necessary background to inference 
within the framework of variational principle. The variational 
problem we will work on next is eqn. [7] We will derive a con- 
vergent message-passing algorithm called the norm-product. 
When the parameters of H are set to the Bethe approximation 
the algorithm reduces to the sum-product (when e = 1) or the 
max-product (when e — 0). When H is concave and e = 1 
the norm-product becomes a globally convergent message- 
passing algorithm, referred to as convex-sum-product, for 
approximating marginal probabilities. When e = we obtain a 
convergent form of max-product we call convex-max-product 
and when e — > we obtain an approximation (with proven 
bounds) to the LP-relaxation solution. 

III. The Norm-Product Belief Propagation 
Algorithm 

We seek an algorithm for minimizing the inference varia- 
tional eqn. |7] with the following properties: (i) if the entropy 
approximation term H is strictly concave, i.e., eqn. [t] is 
a convex-free-energy, the algorithm will be convergent for 
all e > and will converge to the global optimum when 
e > 0, (ii) the algorithm will remain well defined when H 
is non-convex (such as Bethe-free-energy and other fractional 
entropy approximations) and exhibit the property that fixed 
points of the algorithm coincide with stationary points of 
eqn. |7] and (iii) the algorithm uses the graph structure inherent 
sparseness, i.e., is defined by a message-passing architecture 
on the underlying factor-graph. In other words, like the BP- 
algorithms, our scheme should be sending messages between 
variable and factor nodes of the factor graph. 

We will first take a detour and derive a general framework 
for minimizing problems of the type 

n 

mmf{h)+Y,h^{h) (8) 

i=l 

/(b) is a strictly convex, non-smooth, extended- valued func- 
tion of the type /(b) = /(b) + (5b (b) where / is essentially 
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smooth and S13 is the indicator function of the affine set 
B ^ {h : Ah = c}, namely, ^^(b) = if b e ;B and 00 
otherwise. The functions hi{h) are convex extended-valued 
functions (see Appendix |A] on mathematical background). 
In Appendix |B] we derive the following "primal-dual" block 
ascent algorithm which is guaranteed to converge to the global 
minimizer of eqn. [8] 

Algorithm 1 (Primal-Dual Ascent): Let /(b) — /(b) + 
5e(b) where /(b) is strictly convex, essentially smooth 
extended-valued function, and let hiih) be convex extended- 
valued functions. Initialize Ai = 0, A„ =0. 

1) Repeat until convergence: 

2) For i — 1, ...n: 

b) b* = argmin {/(b) + h,{h) + b^/xj 

h^dom{hi)ndom{f) 

c) A,; = — — V/(b*) + A^cr where cr is arbitrary. 
Output b*. 

The vectors A^ and /j.^ are messages passed along edges of 
a bipartite graph with n (function) nodes corresponding to the 
n functions /ii(b) and m (variable) nodes corresponding to the 
dimension of b. Function node i sends the m coordinates of 
vector Xi to the m variable nodes. Variable node j sends the 
j'th coordinate of vectors fii, /x„ to the n functions nodes. 
The algorithm iteratively optimizes with respect to the indexes 
i e {1, n}, stopping when it does not change the beliefs b*, 
thus the network proceeds in an almost cyclic update policy. 
The algorithm fits well with a graphical model architecture in 
the sense that if hi{h) depends only on a small subset N{i) 
of coordinates from b, then Xip — for every f3 ^ N{i) (and 
therefore need not be updated): 

Claim 1: Assume variables b are indexed by {1, m} and 
hiih) depends only on small subset of variables indexed by 
N{i) C {1, ...,to} and let A^ = {Xi ^}- Then, A^ ,3 — for 
all /3 ^ N{i). 

The claim and its proof can be found in Appendix |B] For those 
familiar with successive projection schemes, in the particular 
case when /(b) — /(b), i.e., is strictly convex and essentially 
smooth, and hi (b) — (b) (the indicator function of convex 
set Ci), the update step (b) for Algorithm [T] is a "Bregman" 
projection f6l of the vector /Xj onto the convex set C,;. In 
that case, following some algebraic manipulations (such as 
ehminating /i,, among other manipulations) the scheme (with 
A = 0) reduces to the well known Dykstra fT2| (also goes 
under different names such as Hildreth, Bregman, Csiszar, 
Han) successive projection algorithm which has its origins in 
the work of Von-Neumann fl3]| . Further historical details can 
be found in Appendix [B] 

Another useful property of the algorithm that it is well 
defined for non-convex primal energies. Specifically, we can 
establish the following result: 

Claim 2: Consider Algorithm [T] for Legendre-type function 
/(b) and non-convex continuously differentiable functions 
hi{h) restricted to the affine domain {b : Aih = c^}, and 
assume b* in step (c) is in the interior of dom{f). Then, fixed- 
points of the algorithm coincide with stationary points of the 



non-convex program /(b) + J^i hi{h). 

The proof can be found in Appendix |B-A The result states 
that when hi are non-convex but defined over an affine domain 
the algorithm is no longer convergent, but if it does converge 
it will do so to a stationary point of the optimization problem. 
This property of the algorithm extends the result of |69| about 
the behavior of the sum-product algorithm: if it converges, then 
it converges to a stationary point of the Bethe free-energy. 

The inference variational problem presented in eqn. [7] is 
embedded into the general template of eqn. [8] as follows: 



min/<:(b) 



i=l 



(9) 



where /^(b) = /^(b) + (5s (b) with S being the set of {b = 
{ha} : 6q,(xq,) e V} where V is the probability simplex 
(arrays that are non-negative and sum to one), and /^ is defined 
below: 

(^a (^a 

)-^ec,i/(b„) (10) 

Note that dom{f^) include all b e 5, i.e., /e(b) = 00 for 
h ^ S. The functions h^j are defined below: 

h,,,(h) = ^6),(a;O&«(a:O-ec«iy(bO-^ec.,(7?(b<,)-iy(b0), 

Xj cy.^N{i) 

(11) 

where dom{h^^i) is the affine set consisting of b^ for 
every a e N{i), which live in the probability simplex, i.e. 
doni{hg,i) C dom,{f(), and satisfy the marginal consistency 
constraints ^^(xq) — hi{xi). Note that b^ are not 

explicitly included in b = {b^}, but they are described by 
the values of which all bo, in the domain of h^^i agree upon. 

Given the sparse structure of hf i then, following Claim [T] 
we present the entries of A^ according to the factor-graph 
structure by setting A,; = {Aj ^(xa)} (and likewise /Xj). Step 
(b) of Algorithm [T| is reduced to finding b* for all a € N{i) 
and step (c) updates A^ „ by the rule: 

Xt.ai^a) = -IJ-i.ai^a) " V/e(6* (x^)) + (Jal, 

for an arbitrary CTq,. If instead of updating A^ „ we would 
update Tii^ai^a) = cxp(— Aj q(xq)) the additive degree of 
freedom inherent in the choice of cr turns into a scaling choice 

of rii^a- 

The derivation process required for embedding the defini- 
tions above into the primal dual Algorithm [T] is described in 
detail in Appendix [C] The resulting algorithm, we call norm- 
product, is presented in Fig. [T] 

Just as in the BP algorithms, the message ma^i{xi) from 
the factor node a to the variable node i is a vector over all 
possible states of Xi. The message Ui^ai^a) from the variable 
node i to the factor node a is an array over all possible states 
of Xq. The beliefs bi{xi), which are the approximations to the 
marginal probability p{xi) when e = 1, can be computed from 
the messages ma^i. 



h{Xi) CX I 4>^{Xi) Y\_ "^o 
a£N(i) 



(12) 
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Algorithm 2 (Norm-Product Belief Propagation): We are given nonnegative local evidence ((>i{xi), and nonnegative arrays 
ipa{y^a), where a C {1, n). Let Cia = + and = q + Y.aeN(i) ^a- 

1) Set n.i^ai'^a) = 1 for all i = 1, ...,n, a G iV(z) and x^. 

2) For t 1,2,... 

a) For i = 1, ...n do: 



/ / X l/(^ 

"^j—^a (Xq ) 

^^x„\x, \ jeN{a)\i 

( (l^ '^^iXi) n "^^4^2^^)^ ° / \ -c.„/c., 

^j^a (Xq ) 



/3GAf(i) 

771 



V 



jeN{a)\i 



Fig. 1. The norm-product belief propagation algorithm, where the messages mi^^i{xi) are computed with respect to the ii/^g^^ norm. For Cq = l,Ci = 
I — di, Ciry = it reduces to the belief propagation algorithms, sum-product when e = 1 and max-product when e = 0. Whenever Ca is the weighted number 



of spanning trees through edge a, and c, = 1 — 



Ca and Ci, 



■ it reduces to the tree-reweighted belief propagation algorithms (sum-TRBP and 



max-TRBP). Whenever Cq > 0, c^, c^q, > the norm-product is guaranteed to converge, and if also e > it converges to the global optimum of the program 
in eqn. ^ 



(13) 



where is defined in Fig. [T| The joint beliefs 6q,(xq,) can be 
computed from the messages rii^a- 

l/eco 

ieN(a) 

The norm-product algorithm includes the BP algorithms 
(sum-product and max-product), as well as sum-TRBP |62|, 
max-TRBP |l63l, and NMPLP IHl as particular cases. These 
algorithms relate to the simpler form of the norm-product al- 
gorithm, when — 0. In this setting the messages rii^ai^a) 
depend solely on the local potentials 4)i{xi) and the messages 
mp^i{xi). Therefore the messages ^^^^(xq) can be written 
in the compact form ni^a{xi), replacing x^ with Xi. In this 
case the norm-product algorithm in Fig. [T] takes the form: 



1/e 



E n 



(x„) 



jeN(a)\i 



/3eiV(i)\a 



When using the norm-product with the Bethe entropy approx- 
imation Cia — 0,Ca — i,Ci — I — di there holds Ci = 1 and 
the algorithm reduces to 

j^XaV^i \ jeJV(a)\i / 

Ui^aiXi) OC (pi{Xi) Y\ mp^i{Xi) 
l3eN{i)\c, 

which is the sum-product algorithm for e = 1 and the max- 
product algorithm for e = 0. 

When the factors corresponds to pairwise interactions a ~ 
the messages of norm-product algorithm ma^i and 



n. 



^a can be written by the shorthand notation rrij^i and 
The messages rrij^i of the norm-product algorithm 



in Fig. [T] depends on a single message 



rij^i and whenever 

Cia = the message Uj^i depends only on the messages 
nik-^j for every {k,j} £ which we abbreviate by 

k g N{j). Substituting the value of nj^i into rrij^i we obtain 
the pairwise norm-product, whose update rule consists only of 
the messages nik^j. When e ~ 1 the pairwise norm-product 
algorithm with Cia — takes the form 



(^Xi , Xj ) - 



m,:^y{xj) 



The sum-TRBP 11621 is a special case. The sum-TRBP sets Cij 
as the relative number of spanning trees of the graph which 



include the edge («, j), and sets Ci ~ 1 — J2jeN{i) ^ij- ^ 

result Ci — 1 and by substitution Mij{xi) rri'^_^i (xi) we 
obtain the sum-TRBP update rule as originally introduced in 
(Ilea, eqn. 39): 

M.,(.oocE^r(^^^-.) ^^^^^^"gTr^^^^^^ - 

When e = the pairwise norm-product algorithm with 



takes the form 



mj^i{xi) (X i<[ia:>c'ipij{xi,Xj)- 



ix,)Uk 



The max-TRBP |63| and NMPLP fTSl are special cases, de- 
rived as follows: With max-TRBP, we have and Ci defined 
by the tree-reweighted setting which results in = 1, and 
the Max-TRBP (|63|, eqn. 50) follows from the substitution 
Mij{xi) = ni^ji^l' (xi). The NMPLP is another recent max- 
product-like algorithm where messages ^ji{xi) are defined as 
follows: 



^jii^i) 



E 

keNU) 



Ikj i^j ) 
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Algorithm 3 (Sum-Product Belief Propagation type): We are given nonnegative local evidence (j)i{xi), and nonnegative 
arrays i/jai^a), where a C {1, n}. Let = + and = q + J^aeNii) ^a- 

1) Set rii^ai^a) = 1 for alH = l,...,n, a G iV(i) and x^. 

2) For t = 1,2,... 

a) For i = 1, ...n do: 



"^j—^a (Xq ) 

^^x„\x, \ jeN{a)\i 

( (l^ '^^iXi) n "^^4^2^^)^ ° / \ -c.„/c., 

'^j^a (Xq ) 



/3GAf(i) 

771 



V 



jeN{a)\i 



Fig. 2. Sum-product belief propagation type algoritlim, attained from tlie norm-product belief propagation when e = 1, where the messages are 
computed with the L^/g^^ norm. For = l,Ci = 1 — di,Cia = it reduces to the sum-product belief propagation algorithms, and whenever Ca is the 
weighted number of spanning trees through edge a, and = 1 — Y2aeN{i) ^'^ = it reduces to sum-TRBP. If > 0, c^, Cia > it reduces to 

the convex-sum-product algorithm, which is guaranteed to reach the global optimum of the convex free energy. 



where Wj = 2/((ij + 1). The pairwise norm-product message 
mj^i{xi) with the setting Cj = (1 — dj)/2 and Cij = 1 for 
every gives rise to Cij/cj — 2/{dj + 1). Thus with the 

def 

substitution ^ji{xi) = lnmj_j.j(a;j) and unit local potentials 
{4>i{xi) — 1) we obtain the NMPLP message above. 

The result of having the BP, TRBP and NMPLP algo- 
rithms arise as special cases of the norm-product algorithm 
underscores the generality of our derivation. However, the 
more interesting potential in the norm-product algorithm is 
the emergence of new message-passing schemes which are 
guaranteed to converge (unlike the BP and TRBP algorithms) 
corresponding to the setting of as a concave function 
(ca > 0, Ci, Cia > 0). Three classes of algorithms emerge: 

• The convex-sum-product corresponding to the setting 
e = 1 in the norm-product algorithm. The convex-sum- 
product is guaranteed to converge to the global optimum 
of the primal function eqn. |7] This includes the tree- 
reweighted free-energy in particular and other settings of 
convex-free-energy which are detailed in Appendix [P] 

• The approximate LP-relaxation corresponding to the set- 
ting e — > (but e > 0) in the norm-product algorithm. 
It provides an approximate solution to the LP-relaxation 
whose distance from the true solution is governed by an 
upper-bound we derive. The approximate LP-relaxation 
is guaranteed to converge to the global optimum of the 
primal function eqn. |7] 

• The convex-max-product corresponding to the setting 
e = in the norm-product algorithm. Unlike the max- 
product, the convex-max-product is convergence guaran- 
teed. However, there is no guarantee that the recovered 
solution corresponds to the desired LP-relaxation solu- 
tion. The advantage of convex-max-product is efficiency 
(introduced by instead of Li/^) and very good 
empirical performance. In fact, the convex-max-product 
is a convergent form of max-product. 

These message-passing algorithms, which are collectively re- 
ferred to as convex-BP algorithms, are discussed in the next 



section. 



IV. The Convex Belief Propagation Algorithms 

Eqn. [7] represents the free-energy approximation when e = 
1, the LP relaxation when e = 0, and a perturbation of the 
LP-relaxation for MAP estimation when e — > 0. When the 
entropy approximation term H is the Bethe approximation 
(setting Ca = ^,Ci = 1 — di,Cia = in eqn. |4]) the sum- 
product (e — 1) and max-product (e = 0) arise as special cases 
of the norm-product algorithm. Since in both cases the free- 
energy approximation is non-convex (for factor graphs with 
cycles) the convergence guarantees of those algorithms are 
weak. For the sum-product we have the guarantee that the 
algorithm convergence then it will reach a stationary point of 
the free-energy approximation (see Claim [2] and |69|). With 
the max-product we have weaker guarantees (Claim |2] does 
not apply because is not strictly convex when e = 0) where 
specifically, even if the algorithm does converge the marginal 
consistency constraints might not be satisfied. 

We focus now on the family of convex-free-energies which 
arise with the setting c^ > 0,Ci,Cia > 0. The convex-sum- 
product arises from the setting e = 1 is described next. 



A. Convex-sum-product Algorithm 

As a free-energy approximation (e = 1), eqn. |7] is strictly 
convex and, in turn, the norm-product algorithm is guaranteed 
to converge to the global optimum. We refer to the specializa- 
tion of the norm-product algorithm with Cq > 0,Ci,Cia > 
and e = 1 as convex-sum-product summarized in Fig. |2] 

The beliefs bi{xi), which are the approximations to the 
marginal probability p{xi), and the joint beliefs 6q(xq), which 
are the approximation to the marginal probability p(xq), are 
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computed from: 



aeN{i) 



OC 



jeN{a) 



Note that the algorithm has a much simpler form if Cm = 0. 
The message ni^ai^a) depends only on Xi and becomes: 



(14) 



The convex-sum-product is globally convergent for any con- 
cave setting of the entropy approximation H, i.e., when 
Ca > 0, Ci, Cia > 0. In particular, when the underlying factor- 
graph arises from a graph, i.e., the local interaction forms 
pairwise relations only, there is a setting that corresponds 
to TRW free-energy as described in Appendix |D] We also 
describe there additional parameter settings corresponding to 
other heuristic convex approximations of the entropy term H. 

We describe next the use of the norm-product algorithm as 
an approximation to the LP-relaxation for the MAP problem 
by taking e ^ 0. 

B. LP-relaxation Bounds 

For e > 0, let the global optimum of eqn. [7] (with concave 
H) denoted by and let the solution of the LP relaxation 
eqn. |6] denoted by b*. Let 6 stand for the concatenated 
functions Oi(xi) and ^^(Xq), i.e., 6^h — J2i x- ^1(3^1)^1(2^2) + 
J2ax (^a{xa)ba{^a)- We wish to uppcr-bound the difference 
9 — 6 h* < 5 where (5 is a function of e,Ca,Ci and Cia, 
described below: 

Proposition 1: Let Cq, > 0, c,;,Cm > describe a convex- 
free-energy eqn. |7] Let rii stand for the cardinality of a;, and 
ria = riigArfa) "^i ^c the Cardinality of x^. Then, 



< e^h, - e^h* < s, 



where 



(5 = e I ^ Cq In riQ + ^ Ci In + ^ ^ da In — 

! i i aeN{i) * 



Proof: The sets of beliefs b* , b^ are both in the local polytope 
L(G) whereas the beliefs b* are the optimal ones with respect 
to the original linear program eqn.|6] therefore 6^h* < ^^b^. 
On the other hand the beliefs b^ are optimal for the perturbed 
program eqn. [t] hence O^h, < O^h* + e(if(b,) - H{h*)) 
where H{h) is described in eqn. |4] 
Using Jensen's inequality we obtain: 



i?(b,) =^6,;(x,)ln— - <ln 



bijxi) 

'4^ h{x^) 



= Ini 



and likewise H{ha) < Inn.a- Substituting in eqn. |4]and noting 
that H{\3*) > we obtain: 

H{h,)^H{h*) < ^c«lnnQ + ^c,ln? 



E, ria 
C^a In . 
11 ■ 



D 

As a result, in the ideal world, one could generate the solu- 
tion bj arbitrarily close to the relaxed LP solution b*. There 
are, however, numerical accuracy limitations which in practice 
limit the size of e > ep > 0. The assumption in Proposition[T]is 
that the output b"-^- of the norm-product algorithm, as defined 
in eqns. 12|13 is equal to b^ the solution to the e-perturbed 
LP-relaxation eqn. [7] This is indeed true when e > but 
not when e = 0. As we shall see in more details in the next 
section, the norm-product algorithm is guaranteed to converge 
when e = but not necessarily to the minimal primal value. 
Therefore, from a numerical perspective there exists eq such 
that when e < eq the underlying assumption b" ^- = b^ 
ceases to hold. Moreover, the value of eq depends on the 
graph structure and the potential functions tpa and therefore 
is unlikely to have a simple and useful form. 

C. Convex-max-product Algorithm 

We saw that for the setting of e = and when H equals the 
Bethe entropy approximation then the norm-product becomes 
the max-product algorithm. We now explore the convex-free- 
energy setting Cq > 0,Ci,Cia > while e = and refer 
to the resulting family of algorithms as convex-max-product 
summarized in Fig. [3] 

Note that when Cia = we obtain a much simpler form of 
the algorithm where the message Ui^ai^a) depends only on 
Xi described in eqn. 14 



Algorithm 5 (Convex-Max-Product when Cia — 0): Repeat 
until convergence: 

1) For i — I, ...n and for all a e N{i) do: 



2a^i{Xi) = max < -ipai^a) TT n.j^a{Xj) 

I 

' jeN{a)\i 



.It 



n^^a{Xi) (X 



/3eAf(i) 



The desired output vector bi{xi) is recovered from computing 



the vector 



^^i)X\a<^N{i)^l%^^i) follows. If there 
are no ties, bi{xi) is determined by setting the highest value 
to 1 and all remaining entries to 0. If the highest value of the 
vector is shared among > 1 entries, i.e., there exist ties, then 
those entries receive the value l/r^. If there are no ties, i.e., 
= 1 for i = 1, n, then the result is the MAP solution. 
The setting e = raises two issues (i) if the algorithm 
converges, can one obtain from them the optimal LP-relaxation 
solution?, and (ii) is there a convergence guarantee of the 
convex-max-product family? The answer to the first question 
is generally negative. In a nutshell, the primal function f^^Q 
is convex but no longer strictly convex and therefore the dual 
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Algorithm 4 (Max-Product Belief Propagation type): We are given nonnegative local evidence (j)i{xi), and nonnegative 



eN{i) ' 



arrays jpai'^a), where a C {1, n}. Let Cia = + da and Ci 

1) Set rii^ai^a) = 1 for alH = l,...,n, a S N{i) and x^. 

2) For t = 1,2,... 

a) For i = 1, ...n do: 



Vxi Va e iV(i) ma^i{xi) = max < Vq(xq) TT nj„>a(xa) 

Xa\a:v I 

' j6A'(a)\i 



/3eJV(i) 



jeiV(a)\i 



Fig. 3. Max-product belief propagation type algorithm, attained from the norm-product belief propagation when e = 0, where the messages ma^i{xi) are 
computed with the L^c norm. For Ca = I, Ci = I — di, Cia = it reduces to the max-product belief propagation algorithms. Whenever Ca is the weighted 



number of spanning trees through edge a, and = 1 — 



6iV(i) ' 



and Cia = it reduces to max-TRBP. For Cq = l,Ci = (1 — di)/2,Cia = it 
reduces to the NMPLP algorithm. If Ca > 0, c;, Ci„ > it reduces to the convex-max-product algorithm, which is a convergent max-product type algorithm 
for LP-relaxations. 



function is no longer differentiable. A dual ascent approach on 
a non-differentiable dual function can get stuck at "corners". 
The implication of getting stuck at a corner of the energy 
landscape is that the recovered primal solution b^^o might 
not correspond to the lowest primal energy and furthermore 
might not satisfy the marginal consistency constraints. More 
details can be found in Appendix |B-B| 

We consider now the the second question of whether the 
dual ascent creates a converging sequence? The answer is 
positive, i.e., the convex-max-product algorithm is convergent 
(unlike max-product on general graphs). 

Theorem 1 ( Convergence, Convex-max-product): The 
norm-product algorithm with the parameter setting of e = 
and Ca > 0, Ci, > is convergent. 

Proof: Let (/^(Ai, A„) represent the conjugate dual 
eqn. 



21 



g,(Ai, A„) = -/:(- - E KA^^)' 



i=i 



and let <7o(-^i, -^n) be the limit of as e — ^ 0. The explicit 
form of the conjugate duals /* and h* ^ are: 



KiW=^'^ (t>^i.Xi) Jl ||^^^^^exp(A„(x„))||i/, 



(15) 
(16) 

l/eci 



where \\^^\^^z{yLa)\\l = Y.^^\x, \za{y^a)Y'. The functions 
/o fe^o ^"'l '^o i K^o I well defined and thus. 



is well defined as well. By definition of the block ascent 
scheme, let A^ ^ e argmax_^ ^^(Ai, A„). We note that 
Ao,i — limg^^o ^e.i is well defined because e appears as a 
norm in the definition of the message ni^a- 



We use the shorthand q^{\^^i) instead of 
(7<:(Ai, Ai_i, Ae,i, Ai+i, A„). We wish to show that 
Ao,^ e argmaX;^^(7o(-^i, A„). 

Assume to the contrary that Ao,i ^ argmax_^.(7o(') and let in- 
stead Ao,i e argmaX;^^qo(-)' thus making qa{\o^i) > go(Ao,i)- 
Since qo — limg^o 9c> there exists cq such that for all e < eo 
we have qe{Xo.i) > gol-^o.i) as well. Likewise, using the 
limit argument on the right-hand side, <Ze(Ao.i) > (?e(Ao_i). 
Finally, since Ao,i = limg-yo ^d, and q^ is continuous, we 
have (7e(Ao,i) > qe{\e,i) which contradicts the fact that 
A^,,; g argmax;^.g£(-). [] 

We conclude that the convex-max-product, unlike max- 
product, is convergence guaranteed, since it iteratively im- 
proves the dual objective which is bounded by the primal 
objective. The convex max-product is guaranteed to recover 
the MAP assignment if its beliefs are integral. However, in 
many cases we can use the rounding scheme for the max- 
product type algorithms which guarantees the MAP if the 
beliefs recovered from the messages are without ties 1651 . 

V. Experiments 

In our experiments we first evaluated the quality of the 
max-product type algorithms for solving a linear program with 
pairwise interactions and binary variables 



mm 



y , 9iixi)bi{x^)-\- 

i,a;i6{0,l} 



The max-product type algorithms differ from each other by 
their approximated entropy coefficients Ca,Ci,Cia, but since 
the linear program has no entropy terms, all these algorithms 
aim at producing the same result. We distinguish between three 
families of max-product type algorithms: 

• The first family corresponds to non-concave entropy 
approximation, such as the Bethe free energy whose 



coefficients Cn 



0,Ci 



1 ~ di and Ci, 



produce 



the max-product algorithm. These algorithms are not 
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guaranteed to converge and even if they converge there 
are no guarantees on their solution. 

• The second family corresponds to concave entropy ap- 
proximations with positive Cq., negative c; and Cia = 0. 
The notable member of this family is the max-TRBP 
algorithm |63|, whose is the weighted number of 
spanning trees which pass through the edge a and c; — 
1 — J2aeN{i) '^a- Thcsc max-product type algorithms are 
not guaranteed to converge, but whenever they converge 
one can extract an optimal solution for a pairwise linear 
program with binary variables, cf. fST) theorem 4 and 
1381 corollary 2. 

• The third family corresponds to concave entropy ap- 
proximation with Ca,Ci,Cia > 0. Thcsc convex-max- 
product algorithms are guaranteed to converge to the 
global optimum for a pairwise linear program with binary 
variables, cf. |38| corollary 2 and (TE] proposition 3. 

We used the implementation of the max-product type algo- 
rithm described in Algorithm]?] while each algorithm differs in 
its appropriate Ca,Ci,Cia- To evaluate the performance of the 
algorithms we generated 100 samples of 10 x 10 grids, where 
di and 9i,j were sampled from zero mean Gaussians with 
standard deviation of one. We set the local evidence according 
to 9i{xi) — 6i{~\Y\ and for the pairwise interactions 
6ij{xi,Xj) we set the value 9ij on their diagonal and —Oij 
on their off-diagonal. 

First we investigated the convergence properties of three 
representatives of the max-product families described above: 
The max-product algorithm, the max-TRBP described in 1*631, 
and the convex max-product with the same tree-reweighted 
free energy, represented by Ca, Ci, Cia > as described in Ap- 
pendix |D] The convergence criterion for the max-product and 
max-TRBP algorithms was measured with respect to change 
in their messages, whereas the convergence criterion for the 
convex-max-product was measured with respect to change 
in its dual function. The max-product algorithm converged 
for 25% of the runs, the max-TRBP converged for 90% of 
the runs, and as expected from Theorem [T| the convex-max- 
product always converged. However, the convex max-product 
was slower than max-TRBP, while we measured the primal 
values obtained by both algorithms during their runs. Over 
the runs the max-TRBP converged in average number of 430 
iterations compared to an average of 6400 of the convex-max- 
product with tree-reweighted parameters. 

Next we compared the run-time of three representatives of 
the converging max-product: The convex-max-product with 
tree-reweighted free energy, the NMPLP of 11 Si and the 
convex-max-product with = l,Ci = Q,Cia = which 
was referred as "trivial convex-max-product" by ll65l . We 
measured their convergence with respect to the change in their 
dual objective: The NMPLP converged in average number of 
200 iterations, the trivial convex-max-product converged in 
average of 260 iterations, and the convex-max-product with 
tree-reweighted free energy converged in average of 6400 
iterations. 

To conclude, for linear programs with pairwise interactions 
and binary variables the convex-max-product algorithms im- 
prove upon previous max-product type algorithms: They are 



guaranteed to converge to the global optimum. However the 
convex-max-product algorithms differ from each other in their 
memory requirements and run-time. Among those algorithms, 
the ones with Cia = requires less memory, as their messages 
fii^a depend only on Xi, and have a faster run-time. 

The norm-product family of algorithms can also solve linear 
program using the perturbation method for a small value of 
e, as described in Proposition [T] However the convex-max- 
product algorithms are computationally more efficient, and 
guaranteed converge to the global optimum of linear program 
with pairwise interactions and binary variables. Therefore we 
evaluate the convex-norm-product type algorithms over linear 
programs with non-binary variables 



mm 

&.,fc,,jeL(G) 



E 

i,a:ie{l,2,3} 



.i{xi)bi{xi) + ^ 



{i,j)£E,Xi,Xj£{l,2.Z} 



For these programs the convex-norm-product algorithms are 
guaranteed to converge to the global optimum, whereas the 
convex-max-product can converge to non-optimal stationary 
point. To evaluate the performance of the convex-norm- 
product we generated 100 samples of 10 x 10 grid where 
6i{xi) and dij were sampled from zero mean Gaussians with 
standard deviation of one, and dij{xi, Xj) were given the value 
dij on their diagonal and —Oij on their off-diagonal. 

We measured how often the convex-max-product algorithm 
converges to non-optimal stationary points, comparing to the 
convex-norm-product which always achieves its optimum as 
described in Claim [8] To indicate these events we compared 
the dual value of the linear program, which is evaluated by the 
convex-max-product stationary messages and by the convex- 
norm-product messages, setting e — 0.001 and — l,Ci = 
0, Cia = 0. For 60% of the runs, the dual values attained by 
the convex-max-product and the convex-norm-product were 
0.01 close to each other, indicating both algorithms reached 
the maximal dual value. On the other hand, for 25% of the 
runs the dual value of the linear program attained by the 
convex-max-product messages was 0.1 lower than the one 
attained by the convex-norm-product messages, indicating the 
convex-max-product reached a non-maximal dual value. This 
fact has important practical implications: Only from the dual 
optimal solution one can recover the optimal beliefs that solve 
the primal linear program, while non-optimal dual messages 
always relate to non-consistent beliefs. In particular for the 
25% of the runs the convex-max-product did not produce 
beliefs which agree on their marginal probabilities, whereas 
the convex-norm-product always recover beliefs which satisfy 
the primal linear program constraints. 

In our experiments we also evaluated the sum-product type 
algorithms for approximating the marginal probabilities of 
distribution p(x) of the form 

p(x) (X exp ^ 9,{Xi) + ^ Oa{yia) 
The variational framework for approximating marginal prob- 



abilities, described in Section II-A suggests that the approx- 
imated entropy term affects the quality of the approximated 
marginal probabilities. Although we do not have a theoretical 
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guarantee for setting the best approximation, in these exper- 
iments we show how the different approximations behave in 
practice. We consider two types of free energy approximations: 

• Non-convex free energy approximations, represented by 
the Bethe approximation which corresponds to — 
1, Ci = 1 — di, = 0- The sum-product algorithm aims 
at finding a local minimum for the Bethe free energy 
approximation, but it is not guaranteed to converge. In 
cases where it does not converge we used the double 
loop algorithm ll22l in libDAI ll40l . which is guaranteed 
to converge to a stationary point of the Bethe free energy. 

• Free energy approximations which are convex in the 
intersection of the marginalization constraints. These ap- 
proximations are appealing since their stationary points 
are their global minimum. We address the tree-reweighted 
free energy approximations whose Ca,Ci,Cia correspond 
to spanning trees in the graph, and also to L2 convex free 
energy approximation heuristic described in Appendix [P] 
We note that whenever Ca,Ci, Cia > the corresponding 
convex-sum-product algorithms are guaranteed to con- 
verge to the global optimum. 

We used the implementation of the sum-product type algo- 
rithm described in Algorithm [3] while each algorithm differs 
in its appropriate Ca,Ci,Cia- Following |62| We generated 
100 samples of 10 x 10 grids with binary variables Xi G 
{0, 1}, where 9i were uniformly chosen from the interval 
[—0.05, 0.05], and Oij were either chosen uniformly from the 
attractive interval [Q,uj] or the mixed interval [— a;,a;]. We ran 
the simulations with edge strength oj ranging from to 2. 
We set the local evidence to 9i{xi) = 9i{—l)^\ and for the 
pairwise interactions 9i j{xi, Xj) we set the value 9i j on their 
diagonal and —9ij on their off-diagonal. 

We compared to true marginal probabilities with the ap- 
proximated marginal probabilities recovered from the Bethe 
free energy approximation, tree-rewighted free energy ap- 
proximation, and the L2 convex-free-energy heuristic. Fig. 
[4] shows the average Li error in the marginal probabilities 

We conclude from this experiment that the convex ap- 
proximations are better than the Bethe approximation for the 
attractive settings, when 9ij > 0. However, the Bethe approx- 
imation is slightly better in the mixed settings for u < 1 and 
considerably worse for w > 1. Moreover, in the mixed settings 
the sum-product did not converge for a; > 1 and we used 
the double-loop algorithm instead which is computationally 
more expensive. We also conclude that the L2 convex free 
energy settings produce comparable results to tree-rewiehted 
free energy for grids. 

We also compared the tree-reweighted and L2 convex free 
energy approximated marginal probabilities on the complete 
graph, i.e. every two vertices are connected with an edge. We 
generated 100 samples of complete graphs with 10 vertices 
with binary variables, where 9i were uniformly chosen from 
the interval [—0.05, 0.05], and 9ij were chosen uniformly from 
the interval [0,w], for w ranging from to 2. Fig. |5] shows 
the average Li error in marginal probability, suggesting that 
in the case of complete graph, whose structure is far from 
a tree, the L2 convex approximation heuristic is better than 



Field=0.05, Attractive 




0.5 1 1.5 

Interaction Strengtli 

Field=0.05, Mixed 




0.5 1 1.5 

Interaction Strength 

Fig. 4. Comparison of error in marginal probabilities, estimated by 
Bethe free energy, tree-reweighted free energy and L2 convex free en- 
ergy described in Appendix^\ We computed the Bethe approximation 
by applying the sum-product when converged, and the double-loop 
algorithm otherwise. The other free energy approximations are convex 
and the convex-sum-product algorithm is guaranteed to converge to 
their optimum. The graphs present the average error over 100 random 
trials 

Field=0.05, Attractive 




Interaction Strength 

Fig. 5. Comparison of error in marginal probabilities on a complete 
graph, estimated by tree-reweighted free energy approximation and 
L2 convex free energy approximation. The graphs present the average 
error over 100 random trials 



the tree-reweighted approximation for marginal probabilities 
estimation. 

Generally, the same convex free energy can be represented 
by different coefficients CQ,Ci,CiQ. In particular, the tree- 
reweighted free energy can be described by positive c^, which 
correspond to the weighted number of spanning trees that go 
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runtime is seconds, Field=0.05, Attractive. 
350, 




Fig. 6. Run-time (in seconds) comparisons of convex-sum-product 
against a conditional gradient descent solver (running on convex- 
Z/2 free energy). The algorithms were applied to n x n grids with 
n = 2, 3..., 10. Mean is shown for 10 random trials. 

through the edges a, and negative Ci — 1 ~ J2aeN{i) '^a and 
Cia = 0. However, the same tree-rewieghted free energy can 
be represented by c^, c^, Cia > 0, as explained in Appendix [P] 
These representations affect their corresponding sum-product 
type algorithms: The first representation corresponds to the 
sum-TRBP algorithm which is not guaranteed to converge, 
whereas the second representation corresponds to the convex- 
sum-product which is guaranteed to converge. However, the 
convex-sum-product was slower than sum-TRBP, while we 
measured the primal values obtained by both algorithms during 
their runs. Similar results were reported in (TT\. 

Fig. |6]compares the running time of the convex-sum-product 
algorithm with a general convex solver performing conditional 
gradient descent on the primal energy function |4] which uses 
linear programming to find feasible search directions. We ran 
the algorithms on n x n grids where n = 2,3, ...,10. The 
stopping criteria for all algorithms was the same and based on 
a primal energy difference of 10^^. For a 10 x 10 grid, for 
instance, the general convex solver was slower by a factor of 
20 (e.g., 306 seconds compared to 15.2). For a 2 x 2 grid, 
on the other hand, convex-sum-product took 0.15 seconds 
compared 1.41 seconds for the general convex solver. We 
conclude that the sum-product type algorithms converge faster 
than a general convex solver, since they exploit the structure 
of the graph. 

VI. Discussion 

We have presented a single unified message-passing frame- 
work for approximate inference covering both marginal prob- 
abilities estimation and the MAP assignment problem through 
LP-relaxation. We took a general perspective on the existing 
BP and TRBP algorithms and noted that all are reductions 
from the basic optimization formula of / + where the 

function / is an extended-valued, strictly convex but non- 
smooth and the functions hi are extended-valued functions 
(not necessarily convex). We used tools from convex duality 
to present the "primal-dual ascent" algorithm which is an ex- 
tension of the Bregman successive projection scheme. Most of 
the details of this part of the paper was pushed to Appendix [B] 



in order to reduce the overall technical load for the main-body 
presentation. 

We then mapped the fractional-free-energy variational prin- 
cipal for approximate inference onto the optimization struc- 
ture / + hi and introduced the "norm-product" message- 
passing algorithm. Special cases of the norm-product include 
sum-product and max-product (BP algorithms), TRBP and 
NMPLP algorithms. When the fractional-free-energy is set to 
be convex (convex-free-energy) the norm-product is globally 
convergent for the estimation of marginal probabilities (the 
convex-sum-product branch corresponding to e = 1) and 
for approximating the LP-relaxation (e — > 0). We have also 
introduced another branch of the norm-product which arises 
as the "zero-temperature" of the convex-free-energy (e — 0) 
which we referred to as the convex-max-product. The convex- 
max-product is a convergent solver to the LP-relaxation (un- 
like max-product) but is not guaranteed to reach the global 
optimum. 

As a general statement, the convex-free-energies provide a 
way for obtaining approximate inference over general graphs. 
There are two main issues in this regard: the first is how 
to obtain a guaranteed globally convergent message-passing 
algorithm for the general class of convex free energies, and 
secondly, how to tune the energy parameters Cj , , to a 
specific graph? 

As for the first issue, we have provided a complete treatment 
which also encompasses the existing BP and TRBP algorithms 
(though they do not arise from a convex-free-energy but 
from a non-convex fractional-free-energy). As for the second 
issue, we provided a simple algorithm for converting the 
conventional TRW-free-energy settings to the convex-free- 
energy framework and have also proposed a heuristic principle 
where among all admissible parameters we choose the one 
most closest to the Bethe free energy (Appendix [D|. Empirical 
results show that for certain graphs, like a grid, we obtain very 
close marginal probability estimation results to those obtained 
by the TRW free energy. For complete graphs we obtain a 
very different free energy from TRW and superior accuracy 
of marginal probability estimation. The results suggest that our 
heuristic for setting up the convex free energy satisfies what 
we were after, i.e., to get approximations similar to BP but in 
guaranteed (globally) convergent framework. 

In this work we limited the scope to factor graphs where 
the neighborhoods of every pair of factor nodes have at most a 
single intersection to give a clear description of the mathemat- 
ical details presented in this work. However, the techniques 
presented here can also be used as a basis to a convex 
and non-convex generalized belief propagation [69 1. Different 
algorithms were recently developed for tightening the LP- 
relaxation f5\\, (531, (321 using intersections of increasingly 
larger clusters in order to recover the MAP assignment. We 
believe similar techniques can be applied to convex free 
energies in order to tighten the bound on the log-partition 
function. 

We did not discuss the parallel implementation of the norm- 
product algorithm, but as every message-passing algorithm it 
can be parallelized: One can distribute to the different parallel 
units an independent set of vertices, i.e. vertices which are 



PRESENTED IN PART AT THE CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI), JULY 2008. 



14 



not connected to each other in the graph. This mechanism 
preserves the convergence and optimal guarantees of the 
algorithm. The norm-product can also be made fully parallel, 
as it is a generaUzation of the belief propagation algorithm, 
but in this case convergence is no longer guaranteed. This can 
be fixed by methods described in |19|. 

The convergence rate and the complexity analysis of the 
norm-product algorithm were not addressed in this work. Since 
the convex norm-product algorithm performs a dual block 
ascent it has a linear convergence rate, whenever e,Ca,Ci > 
0,Cia = (cf. |36| Theorem 5.1), i.e. it achieves a (5-optimal 
solution in 0(log(l/(5)) steps. However, this notation does not 
capture the true complexity of the algorithm as 0(log(l/(5)) 
depends on unknown constants which can be very large. 
For this purpose complexity bound were recently introduced, 
where it was proved that the dual gradient ascent attains linear 
complexity, (cf |42| Theorem 2.1.13, |57| Theorem 5.1). 
Although the convex norm-product can be modified to achieve 
linear complexity its step size depends on e,Ca,Ci and the 
modified algorithm is inefficient compared to the convex norm- 
product. We believe this due to the fact that the convex norm- 
product finds the optimal dual assignment in each step, 
unlike the gradient methods. Generally, a complexity bound for 
block coordinate ascent algorithms such as the convex norm- 
product is an open problem. 

Future work is also required for obtaining a firmer theo- 
retical understanding about how to set the concave entropy 
approximation, in order to guarantee a good approximation for 
the marginal probabilities. For example, how tight is the TRW- 
entropy bound, and whether one can find a family of trees 
which guarantees the best bound? Clearly, these theoretical 
guarantees must consider the potentials functions, since for 
every graph its TRW-entropy can be made arbitrary close to 
the true entropy for some potentials. 

Appendix A 

Mathematical Background on Conjugate Duality 

We consider the n-dimensional Euclidean space R" and 
denote vectors in bold face, e.g. x £ R". We start with a 
brief review of basic concepts of sets. A set 5" is said to be 
closed if every of its limit points is contained the set. A set S is 
called open if its complement i?" \S is closed. The interior of 
a set S, denoted by int{S), is the largest open set contained 
in S. The closure of a set, cl{S), is the smallest closed set 
containing S. A point x is a boundary point of 5 if x e cl{S) 
and X ^ int{S) or equivalently if every neighborhood of x 
contains at least one point of S and at least one point not of 
S. A set C is called convex if it contains the line-segment 
between any two points x and y in the set. That is, for every 
< A < 1 the point Ax + (1 - A)y e C. 

For our purposes, since we deal with low-dimensional sets 
placed in higher-dimensional spaces, we use the concept of 
relative interior denoted by ri{S) which, defined intuitively, 
contains all points which are not on the "edge" of the set, 
relative to the smallest affine subspace in which this set lies. 
For example, for a convex set C, x g ri{C) if and only if 
Vy G C there exists z e C and < A < 1 such that x — 
Az + (1-A)y. 



The graph of a function /(x) is the curve {(x, /(x)) : x G 
i?"}, and define the epigraph of a function /(x), denoted 
by epi{f), as the set above its graph, namely {(x, r) : x G 
R"',''' > /(x)}. A functions is called closed if its epigraph is 
a closed set. A function is said to be convex if its epigraph is 
a convex set. A function is called strictly convex if any line 
segment in its epigraph intersects with its relative interior. A 
twice differentiable function is convex if its matrix of second 
derivatives, called the Hessian, is positive semidefinite, and 
strictly convex if its Hessian is positive definite. 

In this paper we work with functions that can take the value 
of infinity and as such are non-differentiable. Such functions 
are known as extended-valued: 

Definition 2 (Extended-Valued, Proper): A function /(x) 
is said to be extended real-valued if — oo < /(x) < oo. 
The effective domain of /(x) is denoted by dom{f) ~ 
{x : /(x) < oo}. A function is said to be proper if 
— oo < /(x) < oo, and it obtains at least one finite value. 

Proper functions typically arise when constraints are embed- 
ded into finite valued functions. For example, the indica- 
tor function associated with a convex set C is defined by 
(5(7 (x) — when x G C and Sc{x) — oo otherwise. A 
possible use of the indicator function is to constrain a finite 
valued function / with the set convex set S to define a proper 
function / = f + Ss- We define next the type of smoothness 
used throughout this paper: 

Definition 3 (Essentially Smooth): Let / be a proper and 
convex function differentiable throughout the non-empty set 
C = int{dom{f)). Then / is called essentially smooth if 
liiTLfc^oo ||V/(xfe)|| = oo whenever Xfc is a sequence in C 
converging to a boundary point x in C. 

Necessary and sufficient conditions for a function to be 
essentially smooth are described in the following theorem: 

Theorem 2 (Legendre type): A closed and proper convex 
function /(x) is essentially smooth if and only if it is differ- 
ential in its interior C — int{dom{f)), i.e. 9/(x) = V/(x) 
for every x G C, while 9/(x) — when x ^ C. If /(x) 
is also strictly convex on C it is called a convex function of 
Legendre type, and its gradient mapping V J : C i?" is 
continuous and one-to-one, and V/* = (V/)^^. 

Proof: L46J, Theorem 26.1 and Theorem 26.5 [] 

The sets {x : a^x > 6} and {x : a^x < 6}, are called the 
closed half-spaces associated with the hyperplane {x : a^x = 
&}. We say that two sets Ci, C2 are separated by a hyperplane 
if each set lies in a different closed halfspace associated with 
the hyperplane. If a vector x is a boundary point of a set 
C, then a hyperplane that contains the singleton {x} and one 
of its halfspaces contains C is said to be supporting C at 
X. In other words, a supporting hyperplane is a hyperplane 
that "just touches" the set C. If C is a convex set then there 
exists a supporting hyperplane for every point on its boundary. 
Supporting hyperplanes play a role in the definition of the sub- 
gradient of a non-differentiable function. A vector A is called 
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a subgmdient of a convex proper function / at x if 

Vz /(z) >/(x) + A^(z-x). 



(17) 



This condition has a simple geometric meaning: it says that 
the affine function ft,(z) = /(x) + A^(z — x) is a (non- 
vertical) supporting hyperplane to the convex set epi(f) at the 
point (x, /(x)). Consequently, the set of subgradients A at x, 
called the subdijferential of f at and is denoted by 9/(x), 
consists of the supporting hyperplanes to the convex set epi(f) 
at the point (x, /(x)). When / is differentiable at x then the 
supporting hyperplane is unique and 9/(x) = V/(x). 

Definition 4: The sub-differential of a function / at a point 
X is denoted by 9/(x) and consists of all the supporting 
hyperplanes of epi(f) at the point x, namely 

a/(x) = {A:Vz /(z)>/(x)+A^(z-x)} 

The following claim describes the sub-differential of the 
indicator function associated with affine sets (a useful result 
which will serve us later): 

Claim 3: Let Ahe k xn matrix and consider the affine set 
B ^ {x : Ax = c} and its indicator function 

Ax^c 

oo otherwise 



<5b(x) = 



Then dSe = {A^cr : cr e R''}. 

Proof: This claim results as a special case of |4| example 
7.1.4. For the sake of clarity we provide a direct proof. We 
describe the sub-differential 9(5e(x) for every point x in the 
domain of Sjg, i.e., Sb{x) — 0. To prove the direction dSjg I) 
{A^cr : cr £ R'^} we must show that (5b(z) > Ssix) + 
(T^j4(z — x) for every z. For every z satisfying Az = c this 
relation holds since (5e(z) = and A{z — x) = 0. For every 
z with Az c this relation holds since (5e(z) — oo. 

To prove the other direction dSs C {A^ cr : cr £ R''} we 
must show that Sb{z) > Sb{x) + {z ~ x) only if A = A'^cr 
holds for every z. First we note that the set {z — x : Az — c} 
is orthogonal to {A^ a : cr £ R'^}, therefore if we assume 
on the contrary that A ^ A^ cr there must be a vector zq — x 
with non-vanishing angle with A, namely A^(zo — x) > 
therefore Definition |4] does not hold for A. Q 

Claim 4: Consider a function / whose domain is contained 
in the affine set B ^ {x : Ax = c}. Then whenever A £ 
df{x) there holds (A + A^ cr) £ df{x) for every cr. 
Proof: /(x) can be equivalently written as /(x)+(5e(x) where 
6b is the indicator functions of the affine set B, therefore df — 
d{f + Sjg). From the linearity of the sub-differential, cf. Q 
Theorem 4.2.4, there holds a(/(x)+(5e(x)) = df(x)+dSBix) 
and the claim follows since A £ df{x) by assumptions, and 
A^cr e dSeix) from Claim |3] [] 

A supporting hyperplane at x with A-slope must satisfy 
Definition |4] namely 

Vz /(z) - A^z > /(x) - A^x, 

therefore it must hold that the A-slope hyperplane supports the 
epigraph at (x, /(x)) where x £ argmin{/(z) — A^z}. This 
leads to the definition of the conjugate function: 



Definition 5: The Fenchel-Legendre conjugate is: 
r(A)= max {(A^x-/(x)}. 

The conjugate /*(A) describes the offset of the A- 
hyperplane that supports the epigraph of /. Note that regard- 
less of the structure of /(x) its conjugate function /*(A) is 
closed and convex, since it is the pointwise maximum of a 
collection of affine (closed) functions. Furthermore, if / is 
convex then the conjugate of its conjugate returns back /, i.e., 
/** = / (cf |4|, Theorem 7.1.1). The following claim is a 
useful result which shall serve us later: 

Claim 5: Let .g*(A) = /*(A — //), then ^(x) = f{x) + fjJx 

Proof: The definition of the Fenchel-Legendre conjugate of 
g(x) — /(x) + fjJx takes the form .g*(A) — max^^jA^x — 
/jJx — /(x)} which has the form in the claim since A^x — 

M^x - (A - f,yx n 

The convex conjugate plays an important role in duality. 
Consider contrained minimization under linear constraints 
^x ~ 0, i.e., aj'^x = 0, a^x = with being the i'th row 
vector of A. The statement about the existence of Lagrange 
multiplier for non-differentiable functions is described next: 

Theorem 3: (Lagrange multipliers) 

Let /(x) be a proper convex function and consider the convex 
program 

min /(x) subject to Ax = 0. 

xGdom(/) 

Assume ri{dom{f)) intersect the linear constraints Ax — 
and that the optimal value of the program is finite. Then there 
exists Lagrange multipliers A|,...,A^ satisftying 



X* £ argmin {/(x) 



E 



A* a J 



(18) 



Proof: The assumptions 6.4.1 in H hold in this case and 
following the Nonlinear Farkas lemma, as done in Theorem 
6.4.2 in li4J, completes the proof. Q 

The duality theorem using the conjugate /* is described 
below: 

Theorem 4: (Strong Duality) Let / be a convex proper 
function and ri{dom{f)) intersects with the constraints Ax = 
0, and that the optimal value of the program is finite. The 
following form a primal-dual pair: 



(primal) min /(x) s.t. Ax 

(dual) max — /*(— A^A) 

A=Ai,...,Afc 







(19) 
(20) 



Then there is no duality gap and there exists primal-dual 
optimal pair. Moreover, the vectors (x* , A* ) form a primal- 
dual optimal pair /(x*) = -/*(— A^A*) if and only if the 
following "algorithmic certificate" for optimality hold: 



X* £ dom{f) 
^£d{f{x*)+A^X*} 



(feasibility) 
(optimality) 



Proof: The existence of primal-dual optimal pair follows from 
Theorem [3] The rest follows from (|4l. Theorem 6.2.5 [] 
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Note that due to the hnearity of the sub-differential d{f + 
g) = df + dg, the optimality condition above is equivalent to 
-A^X* e 9/(x*). 

To see the connection to Lagrangian duality, note that by 
definition of /* we have: 

-f*{-A^X*)= min {/(x)+A^A*}, 

xG(iom{/) 

which in turn means that the primal-dual pair (x* , A* ) satisfy 
X* e aigmin^{f{x) + A^X*} where the right-hand side is the 
Lagrangian i(x, A) = /(x) + X and the dual problem is 
maxA q{X) where g(A) = miiix i(x, A). 

A proper convex function /(x) is essentially strictly convex 
if it is strictly convex on every convex subset in dom{df). We 
note below that in order for the dual function to be smooth 
the primal must be strictly convex. A smooth dual is necessary 
for a dual ascent scheme (described later). 

Theorem 5: (strict primal <S=> smooth dual) 

A closed proper convex function is essentially strictly convex 
if and only if its conjugate it essentially smooth. 

Proof: L46J, Theorem 26.3 [] 

We describe below two Fenchel duality theorems which 
are the functional form of the Lagrange duality where the 
constraints are implicit in the functions domains: 

Theorem 6: Basic Fenchel DuaUty I 

Let (7(x),/i(x) be proper closed and convex functions and 
ri{dom{g)) n ri{dom{h)) ^ 0, and the value of the program 
is finite. The following are primal and dual programs: 

Primal: mmg{x) + h{x.) 

X 

Dual: max-g*(-A) - /i*(A) 

Then there is no duality gap, and there exists primal-dual 
optimal pair. Moreover, the vectors (x* , A* ) are primal-dual 
optimal pair if and only if —A* G dg{x*) and A* e dh{x*). 
Conversely, by reversing the roles of primal and dual, the 
vectors (x* , A* ) are primal-dual optimal pair if and only if 
X* e 9g*(— A*) and x* e dh*{X*). In particular, if g(x) is 
essentially strictly convex and g* (A) is finite, then the optimal 
X* is determined by x* e Vg*{—X*). 

Proof: We reduce Fenchel duality to Lagrange duaUty in The- 
orem]?] where we consider a decomposed version of the primal 
function f{xg,Xh) ~ 5(xg) + h{xfi) subject to the linear 
consistency constraints Xg = x/j. Note that the vector equality 
constraint is composed from rn equality constraints where m 
is the length of the vectors Xg and Xh, therefore we expect to 
use Lagrange multipliers vector A of length n. The Lagrangian 
L(xg,Xh, X) takes the form ,g(xg) + /i(x/i) + A^(xg — x/i) and 
using the conjugate notation in Definition. ]5]the dual function 
q{X) — miiixg.xft i() takes the form in the theorem above. 
Following Theorem ]4] there exists primal-dual optimal pair 
which must satisfy the feasibility condition, i.e. x* ~ x,*, 
and the optimality condition, namely —A* e dg{x*g) and 
A* e dh{x*j^). The theorem follows as the optimal x* must 
equal x* as well as xj^. Reversing the roles of primal and 
dual are allowed by convexity whereby g** — g, h** — h. 
Furthermore, since g*{X) is finite Theorem ]5] determines 



g*{X) to be smooth, and whenever x* e dg*{—X*) there 
must hold X* € V.g*(-A*). [] 

The next theorem is generalizes the Fenchel duality theorem 
above: 

Theorem 7: Basic Fenchel Duality II 

Let f{x),hi{x),...,hn{x) be proper, closed and convex 
functions and ri{dom{f)) r\ri{dom{hi)) ^ and the optimal 
value of the program is finite. The following are primal and 
dual programs: 

n 

Primal: min /(x) + \^ hi{x) 

X ^ — ^ 

4=1 

n n 

Dual: max{-r(-^A,)-^;i*(A)} (21) 

i=l i=l 

Then there is no duality gap, and there exists prima-dual 
optimal pair. Moreover, the vectors (x* , A* ) are primal-dual 
optimal pair if and only if — X^iLi ^ 9f{x*) and A* e 
dhi{x*). Also, if /(x) is essentially strictly convex and /*(A) 
is finite, then x* = V/*(- Y.i K)- 

Proof: The proof closely follows the one of Theorem |6] where 
we consider a decomposed version of the primal function 
/(x/)+^j hi{xi) subject to the linear consistency constraints 
Xf = Xi. The Lagrangian L() takes the form /(x/) + 
/i,j(xj)+^^ Xl {xf—Xi) and the dual function q{Xi) takes 
the form in the theorem above. Following the Lagrange duality 
in Theorem]?] there exists primal-dual optimal pair which must 
be primal feasible, i.e., x^ = x*, and satisfy the optimality 
condition - Y.7=i ^* e df{x)) and A* e dh^{x*). Whenever 
/(x) is essentially strictly convex and /* (A) is finite, repeating 
the primal-dual reversing argument of Theorem ]6] shows that 

x* = v.r(-E.An.D 

Algorithmically, minimizing the primal program /(x) + 
X]"=i hi{x) requires to take into account the domains of / and 
hi simultaneously. Therefore, it is algorithmically appealing to 
solve the primal program in a piece-meal fashion using dual 
block ascent, while iteratively improving a single vector A^. 
This way one need to consider only sub-problems that consists 
of /* and a single h*. After we recover the optimal A* one 
can recover efficiently the primal optimal x* by using the 
smoothness of /* as describes in Theorem]?] 

Algorithm 6 (Dual Block Coordinate Ascent): Initialize 
Ai =0,...,A„ = 0. 

1) Repeat until convergence: 

2) For i = 1, ...n: 

b) A, ^ argmax {-/* (-A, - /xj - h* (A,)} 
Output X* = Vr(-E.A*). 

The dual block ascent algorithm iteratively improves the dual 
objective therefore is guaranteed to converge. Whenever /(x) 
is strictly convex in its domain its conjugate is essentially 
smooth and the dual block ascent is guaranteed to converge 
to the global optimum, as formally described below: 
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Theorem 8: (Dual Block Ascent) Let /, hi be closed con- 
vex functions and assume the relative interior of their domains 
intersect. In addition, assume hi are continuous over their 
domains and / is strictly convex over its domain and /* is 
finite. Then, the dual block ascent algorithm converges to the 
dual and primal optimum. 

In particular, if the dual sequence is bounded then every of 
its limit points is an optimal dual solution A J,..., A*. Also, 
consider the primal sequence generated by V/*(— ^jAj) 
computed from the dual sequence, then this primal sequence 
is bounded and its limit point is the optimal solution x*. 

Proof: 1361. □ 

Appendix B 
The Primal-Dual Block Ascent Algorithm 

We describe an algorithm for solving programs of the form 

/(b)+^/l,;(b) 

i 

while solving sub-problems which consists of /(b) and a 
single function hiih). In our framework we include convex as 
well as non-convex optimization, but for now we describe the 
convex settings, and later describe the necessary conditions for 
this optimization scheme for non-convex programs. The dual 
block ascent method, described in Algorithm |6] decomposes 
the optimization program to sub-problems which solve a 
dual function which requires the explicit computation of the 
conjugate functions /*(A) and h*{\) — a task which is often 
algorithmically unattractive or unfeasible. Instead, one can 
recover A^ in Algorithm |6] by solving its primal program and 
using the primal-dual optimality condition in Theorem |6] as 
follows. Set h{h) -s— h.i{h) and g{h) /(b) + b^/ij and re- 
call Claim|5]from which we obtain g*{—\i) = f* {—Xi — fj,i), 
and solve the primal program: 



b* = argmin {/(b) 

h^dorn{f)ndom{hi ) 



b^/X,;+/l,(b)} (22) 



If the pair of functions /(b) and hi{h) satisfy the assump- 
tions of Theorem |6] then the functions g{h) ^ /(b) + b^/Xj 
and hi{h) satisfy these assumptions as well and, hence. A; 
can be recovered from the optimality conditions of Theorem 
|6l 



A* -9/(b*)}n9/j,(b*) 



(23) 



Taken together, one obtains the primal form of the dual 
block ascent algorithm, in which one need not compute the 
conjugate functions: 

Algorithm 7 (Primal-Dual Vanilla): Let the functions /(b) 
and hi{h) satisfy the conditions of Theorem|8] Initialize Ai = 
0,...,A„ = 0. 

1) Repeat until convergence: 

2) For i — 1, ...n: 

b) b* 4- argmin {/(b) + /i,(b) + b^/xj. 

h^dom{f)r]dom(hi ) 

c) Recover A, e {-/x, -df{h*)}n dh,{h*) 



Output b*. 

A useful property of the primal-dual algorithm (and its 
special cases described in the sequel) is that the sparseness 
structure of A,; conforms to the local structure of the functions 
hi in the following sense: assume the variables b are indexed 
by 1,...,TO, and the function hi{h) depends on small subset 
of variables indexed by N{i) C {1, ...,m}, then A*^, contains 
information only for a € N{i) and the remaining entries 
vanish. 

Claim 6 (Locality of Dual Variables): Assume variables b 
are indexed by {!,..., m} and hi{h) depends only on a subset 
of variables indexed by N{i) C {1, • ., m}, then the following 
hold: 



A e dhi{h) 



■i(3 ^ N{i) A^ = 



Proof: 

b = 



Consider the 



(b 



Ar(i)j 



decomposition 
where 



of 



b to two parts 
^(j-)), where N{i) is the complement 
{1, to} \ A^(i), and likewise for the sub-gradient A. Fol- 
lowing Definition |4] if A e dh(h) then 

h, (b) > h, (b) + A^ (b - b) Vb. (24) 

b) decomposes to the sum of 

' jv(i)(bw(i) ~ b^(j)) 



The linear 



term 
b 



A^(b 



K{'i){^N(,) - ^N(i)) and Ajv(i)(b^f,) - b^^^^). Since b is 



arbitrary we can choose b = (b 



N(i), "N(i) 



where b 



N{i 



IS set 



to b 



r(A 



N{i) - 
N{i) 

K{h) > h,ih 



JV(j) 



and b/vrn is arbitrary. Eqn. 



for some arbitrary scalar r > 0, 
then becomes: 



24 



for all r > 0. If we assume to the contrary that A^jj^ ^ then 
we can increase the value of r and thus make the righ-hand 
side of the equation arbitrarily high, while not effecting the 
left hand side since hi{h) is independent of r by the claim 
assumption (as hi depends only on the variables indexed by 
N{i)) - in contradiction to A e dh{h). [] 

The primal-dual algorithm is still unattractive as it requires 
the evaluation of the sub-differentials of df and dhi which 
could be as difficult as the computation of the conjugate 
functions. Our setting, however, is more constrained than the 
setting described in Theorem |8] In particular, the function / = 
f + &B where / is essentially smooth and B = {h : Ah = c} 
is an affine set. Since / is non-differentiable the dual is not 
strictly convex, and thus we cannot expect Aj to be uniquely 
defined. Nevertheless, we show below that A^ has a convenient 
and simple form. 



Claim 7: Let /(b) 
smooth and B — {h : 



= /(b) + 5e(b) where / is essentially 
Ah = c} and assume that dom{hi) C 



dom{f). Assume the functions g(h) ^ /(b) + b /Xj and 
h{h) hi{h) satisfy the assumptions of Theorem|6] Then for 
every real vector cr the sub-gradient A* — —fj,^ — V/s(b*) 



A^cr is optimal dual, i.e. satisfies Eqn. 23 



Proof: Theorem |6] ensures the existence of a primal-dual pair 



(b*,A*) which satisfy Eqn. 23 The domains of /(b) and 



hi{h) are contained in B by assumption, therefore by Claim 

El 



Vtr 



A'^a e {- 



fi,-dfih*)}ndh,{h*), 
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meaning that for every cr the sub-gradient (A* + A^cr) is 
dual optimal. From linearity of the sub-differential we have 
df(h*) = V/(b*) + OSb- Following Claim [3] the sub- 
differential Sb is represented by vectors in the linear subspace 
spanned by the columns of , denoted by A^a-Q. Using 
again the linearity of the sub-differential we deduce 

Vrr (A* + A'^a) - -/x, - V/(b*) + A'^ {a - cto) 

is a dual optimal sub-gradient. The claim follows by replacing 
A*^(A*+AT^). Q 

Algorithm 8 (Primal-Dual Ascent): Let the functions /(b) 
and hi{h) satisfy the conditions of Theorem [s] where in 
addition let /(b) = /(b) + 5 13(h) where /(b) is essentially 
smooth, B ^ {h : Ah — c} and dom(hi) C do'm(f). 
Initialize Ai = 0, A„ =0. 

1) Repeat until convergence: 

2) For i — I, ...n: 

b* -s— argmin 

h^dom(f)ndom(hi 
A,; ^ -/X, - V/(b*) - 



a) 
b) 



{/(b) + /j,(b) + bT^J 



c) 

Output b*. 

Claim 8 (Convergence): Algorithm |8] converges to the dual 
and primal optimum. Moreover, its primal sequence converges 
to the primal optimal point b* and whenever its dual sequence 
is bounded every of its limit point is an optimal dual solution 

1: 

Proof: Algorithm [8] implicitly performs dual block ascent and 
the dual sequence it generates is identical to the dual sequence 
generated by Algorithm [6] therefore inherits the features de- 



scribed in Theorem |8] Theorem 
sequence describe in Theorem^ 



6] relates b* with the primal 

by b*-V/*(-A, -/.,)□ 



The special case of Algorithm [S] when hi = 6c^, where 
Ci is a convex set, and / is essentially smooth, i.e., A — 
0, can be mapped (by eliminating step 2(a)) to a successive 
Bregman projection algorithm |6|, |7| which is also known 
under the names of Dykstra, Hildreth, Han and Csiszar This 
class of iterative projection schemes has a long history starting 
from Von-Neumann in the 50s |43 | who introduced the case 
where /(b) = ||b — bo||^ and C.j are affine sets. In that case 
the primal solution is to find the projection of bo onto the 
intersection of the affine sets CiD-.-DCn and the sub-problem 



in Eqn. 22 corresponds to the projection of /Xj onto the affine 
set Ci. Hildreth 1231 extended the problem with open half 
spaces Ci — {x | a^x < bi}. Bregman |5 1 extended Hildreth's 
problem setup by including any strictly convex function /. The 
special case of Entropy projections was introduced later by 
Csiszar |10|, as /-projections. Dykstra lfT2l . ifTTI was the first 
to introduce general convex sets Ci (i.e., going beyond affine 
sets or half-spaces) but limited the treatment to / representing 
the Euclidean norm and the KL divergence. The view of the 
algorithm with general essentially smooth / and convex sets 
Ci as performing successive Bregman projections is due to 

im, 0. 

Algorithm |8] extends the body of iterative schemes men- 
tioned above along three directions: (i) / is extended to 



non-smooth functions which in turn makes A^ non-uniquely 
defined, (ii) as a result A^ is defined up to an additive term 
which in the context of the message-passing norm-product 
algorithm (and its special cases) translates to the normalization 
of the messages rii^a, and (iii) our algorithm has two auxiliary 
variables, /Xj and Ai, which allows a straightforward mapping 
onto a message-mapping framework and complies with the 
local structure of the underlying graph (Claim [TJ. 

A. The non-convex case 

So far both / and hi were convex, yet Algorithm |8] is 
still well defined when the functions hi are non-convex. The 
purpose of this section is to clarify what can be guaranteed 
under such conditions. We will show that indeed there is no 
convergence guarantees, but if the algorithm does converge 
then it will do so to a stationary point of the primal program. 

To minimize the program /(b) + '}2,ihi(h) one must in- 
troduce Lagrange multipliers Ai,...,A„. Whenever /(b) and 
hi(h) are convex the Lagrange multipliers are the arguments 
of the dual function, and recovering A* amounts to improving 
the dual objective with its best Aj -arguments, therefore this 
procedure is guaranteed to converge. When hi are non-convex 
the Lagrange multipliers do not correspond to a dual function, 
and thus recovering A* amounts to finding a stationary point 
with respect to a sub-problem involving /(b) and a single 
hi(h), and convergence cannot be guaranteed in general. 
Nevertheless, in each iteration we recover Lagrange multipliers 
for a stationary point of a related sub-problem, therefore, 
intuitively, if this method converges, it reaches a stationary 
point of the non-convex program /(b) + /ii(b). 

We consider programs with non-convex smooth functions 
hi(h) restricted to the affine domain {b : Aih = c^}, and 
Legendre-type function /(b), whose conjugate function is 
finite. Recall Theorem |2] describing Legendre-type function 
as an essentially smooth function which is strictly convex 
in its interior and satisfies V/* = (V/)~^. For this type 
of non-convex programs we show in the following claim, 
that if Algorithm |8] converges, it reaches a local-minimum of 
/(b) + E./i,(b): 

Claim 9: Consider Algorithm [8] with A ^ for Legendre- 
type function /(b) and non-convex continuously differentiable 
functions hi{h) restricted to the affine domain {b : Aih = 
Ci}, and assume b* in Eqn. 22 is in the interior of dom{f) 
relative to the affine set dom(hi). Then if the algorithm 
converges it reaches a stationary point of the non-convex 
program /(b) + J2t h^{h). 
Proof: The optimization 

b*'(')= argmin {/(b) + b^/x, + /i,(b)} 

satisfies the conditions of the Lagrange multiplier Theorem]?] 
with respect to the affine set dom(hi), therefore if algorithm 
converges there holds 

(*) Vi V/(b*^W) + /X, + V/i,(b*'W) + Aju* = 

From steps 2a and 2c, for every i there must hold 
X]j=i -^j = — V/(b*'*^*^). The conjugate of Legendre-type 
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min < -y^bi{x,)liicj3,{xi) - 6q(x„) ln?/>i,a(x„) - eciiJ(bi) - ec,,Q(i7(ba) - i?(bi)) > 

bi,bo,,aeN(i)\ ^ — ' ^ — ' ^ — ' ^ — ' 

\ Xi aeN(i) x„ aeN(i) ) 

subject to : 



Fig. 7. The local sub-problem of in step (b) of Algorithm 111 (min /j(b) + fi^ + /ij ^(b)) solved by the norm-product algorithm. 



function satisfies V/* = (V/) ^ by Theorem [2j therefore for 
every i holds b*'(*) = '^f*i-J2j^^)■ This impHes that the 
local primal arguments b*'^*' are the same for every i, and we 
denote them by b*. Summing up the relations in (*) we get 

n n 

^ V/i.(b*) + nV/(b*) + ^ /X, + ^ A,u* = 0. 

i i—1 i—1 

Substituting — — V/(b*) (from step 2c) we obtain 

the stationary condition for b*, i.e. V/(b*) + V/ii(b*) + 



B. The non-strictly convex case 

The case e = in eqn. [7] corresponds to having a non- 
strictly convex function in eqn. 



10 This situation can be 



analyzed in greater generality by observing the behavior of 
Algorithm |7] when the function / is convex but not strictly 
convex. 

For convex /(b) and hiih) the primal program in Eqn. 



19 upper bounds the dual function in Eqn. 20 and the dual 
block ascent optimization scheme which iteratively improves 
the dual function must converge. If the function /(b) is not 
strictly convex its conjugate is not smooth and the dual block 
ascent is not guaranteed to reach the global optimum. We 
describe, in a nutshell, where things go wrong in the Algorithm 
|7] Assume the algorithm converges, then for every i we obtain 
the primal solution 



argmm 

h^dom(f)ndom{hi ] 



{/(b) 



Recovering the dual variables corresponds to finding A* G 
{-H* ~ a/(b*-('))}. Recall that fi* = T,J^^>'^ *en the 
primal-dual relation boils down to — A* = 9/(b*'*^'^) for 
every i. If / was strictly convex it would have imply that all 
the b*'*^*^ are in fact the same b* and it would have ensure 
optimality. Since /(b) is not strictly convex it means that the 
algorithm might converge in the dual domain but we cannot 
recover a consistent b*. 



Appendix C 
The Norm-Product Algorithm 



to the factor-graph structure by setting A^ = {K.ai^a)} (and 
likewise /Xj We first define few short-cut notations: 

def 

i^i,a{Xa) = l/'a(Xa)exp(-^i,a(Xa)), (27) 
def 



def 



Step (b) of Algorithm [T| is reduced to finding b* for all a G 



N{i), described in eqn. 25 



in Fig. |7] 

We will derive the optimal b* and show it has a closed-form 
solution. In the process we will be relying on the following 
observation which we present as a Lemma, without a proof: 

Lemma 1: Let ijj be a non-negative array and p* be the op- 
timal probability array for the following optimization problem: 



P 

then. 



argmm 

P(x)>O.E>cP(x) = l 



n(x) InV'(x) - iJ(p) 



p*(x) 



V'(x) 



p*(x) ln7/'(x) - ff(p* 



(28) 
(29) 



D 



We will be repeatedly using Lemma [T] in the derivation 
of b*, as follows. Let ha\i{y^a\xi) and H{ha\i) be defined 
below: 

def baiXa) 



t*Q|2 \Xi^ 



bi{xi 



def 



In 



bg (Xa) 

bi{xi) 



Note that the constraint ha\i{:x-a\xi) G V, i.e., that b^i^ 
lives in the probability simplex, is equivalent to the marginal 
consistency constraint 6q(Xq) = b(xj) as well. We 

can use HCb^^i) to simplify the conditional entropy term 
H{ha) — H{hi) by the following Lemma: 



Lemma 2: 



iI(b„)-F(b,) = ^6,(x,)i?(b„|,) 



We embed the function definitions of /^ and h^ i into the 
primal-dual Algorithm [T] Given the sparse structure of h^^i 
then, following Claim llj we present the entries of A; according 



Proof: The Lemma is based on the definition of con- 
ditional entropy H{X \ Y) = H{X,Y) - H(Y) = 
'^yP{y)H{X \ Y = y) for random variables X^Y. In our 
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mm <- bi{xi) ln(l)i{xi) - eCiH{hi) bi{xt) ecia 

Xi Xi aeN{i) 



X.a\Xi 



(*) 



(**) 

Fig. 8. Reducing the local sub-problem in Fig. |7]to a series of normalizations by introducing conditional entropies. 



(26) 



terms we have H{xa \ Xi \ Xi) = H{ha) — H{hi) = 

E,. 6,(x,)i/(b„|,). D 

With the definitions above, the optimization problem of step 
(b) as described in eqn. |25] can be broken down to a cascade 
of two steps, described in eqn. 26 in Fig. [8] 

From Lemma [T] (eqn. [29] l we obtain the solution for the 
inner optimization block (*): 

We make the following definition: 



clef 



J2 kM'^^'''^ 

^Xq, \Xi 



(30) 



Therefore, the inner-block denoted by (**) takes the form: 

= -In ma-^i{xi). 

aeN{i) 



Substituting back into eqn. 26 we obtain: 



~ aeN{i) 

and from Lemma [T|(eqn. [28]l we obtain a closed-form solution 

for b*{xi): 

l/e&i 

b*{xi) (X ^ (t)i{xi) Y\_ ma^i{xi) 

aeN{i) 

Finally, fe* (xq) — h'^,^{xa\xi)b* {xi) takes the form: 



(31) 



l/ec,a / 



-AA^ay^^'''"^ (32) 



Next we evaluate step (c) of Algorithm [T] i.e., 

A*„(Xq) = -Hi^Xa) - V/e(6* (Xa)) + CTal, (33) 



where CTq is an arbitrary scalar and is defined in eqn. 
Define n.i-^ a i^a) as follows: 

def 



10 



(34) 



We note, therefore, that the additive constant freedom in the 
definition of A; „ becomes a scaling choice in the definition 



of rii^a- Without loss of generality we choose the scale such 
that ni^ai^a) G V. The claim below sets the value of rii^a- 



Proposition 2: 



Cia/ Cio 



(35) 



Proof: From definition of rii^a and from eqn. |33] we have: 

ni^ai^a) « cxp(^j,c«(xa))exp(V/e(6* (Xq))). 
Substituting the value of V/£(&* (xq)): 

V/,(6* (x„)) = -lnV'a(x„) + ec„(ln6;(x„) + 1), 



and the value of 6* (xq) from eqn. 32 we obtain: 

nt^ai^a) « exp(^i 



m': [Xi 



and following substitution of Cia = Cq + Cia we obtain what 
we set out to prove. [] 

Substituting /x, ^ — J2jeN{a)\i ^j,a into the definition of 
ipi^a (eqn. 27 1 we obtain: 



V'i,a(Xa) = V'a(Xa) exp(-Aj,a (x^)) 

j€N{a)\i 

= Va(Xa) Yi T^]->a{y^a) (36) 

jeN(a)\i 

Substituting eqn. p6| into eqn.[30|we obtain the update rule for 



/ 



1/ecio 



E n 



yXo,\xi y jeN(a)\i 

Substituting eqn. |36] into eqn. |35] we obtain: 



(37) 



V'q(xq) ]^ nj_>Q(xQ) 

I jeN(a)\i 
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and substituting b* in eqn. 3 1 we obtain the update rule for 



rii- 



Appendix D 
Convex-Free-Energy Parameter Settings 

The fractional entropy approximation of eqn. |3] 
^c„iJ(b„) +^c,i/(bi), 

a i 

is strictly convex if it can be written as eqn. |4| 
^c„i/(b„) + ^c,i/(b,)+ cUH[h^)-H{h,)), 

in terms of > Q,Ci,Cia > 0. In this section we will 
introduce a number entropy approximations which fall into 
the convex-free-energy class. We will start with the Tree -re- 
weighted (TRW) entropy approximation |61| and then intro- 
duce other approximations. 

There are two ways, introduced in the literature so far, to 
set parameters for the TRW entropy approximation — both 
of which do not belong the required setup of a convex-free- 
energy. In the first version, the TRW-free-energy corresponds 
to the setting of > 0, = 1 - Y.aeN{i) and c,a = 0, 
where the setting of corresponds to the relative number 
of spanning trees (or hyper-trees) of the graph which include 
the edge (hyperedge) a. The problem with this setting is that 
Ci < 0, thus, even though the fractional entropy approximation 
is convex, the functions hi (defined in terms of Ci and Cia) 
are not convex. 

The second version, introduced by liTTI . sets q as the 
relative number of spanning trees that have node i as a root, 
and for an edge a = Cia is the relative number of 

trees that include the directed edge j — > i. It is possible to 
find such edge probabilities for the uniform distribution over 
all spanning trees by employing a variant of the matrix tree 
theorem for directed trees, liTTl . Il58l p. 141. In this formulation 
Ci, Cia > but Ca = 0. The problem with Cq, = is that the 
function / is no longer strictly convex. 

In the claim below we show how to convert a TRW setting 
according to the second version, i.e., where Ci, Cia > 0, Cq, = 
to the convex-free-energy setting c'^ > 0, cj,cj„ > 0: 

Claim 10: Assume an approximated entropy 

J2a^aH{ha) +J2iCiH{hi) is described by Ci,c^a > and 



= 0, i.e. Ca = Ca+J2 



ieN(a 



Cia and a = Ci-J2, 



■eN{i) ' 



Then there exists c'^,c[,c\^ > which agree on the 
approximated entropy, namely Cq = + J2teN{a} ^la and 

^i—^i" ^Ylia(^N{i) '^ia 

Proof: We describe an efficient algorithm for constructing the 
desired convex free energy: Initialize = 0, = 0, = 0. 
For every i ~ 1, n and every a E N{i) consider the entropy 
combination 



1) Cia < Ci/di, then the entropy can be equivalently written 
by the entropy 



CtaH{ha) 

therefore perform 



a) C'a^ c'^+ Cta 

b) c[^c[ + if^-C,a)- 

2) Cia > Ci/di, then the entropy can be equivalently 
presented as: 



(i/(b„)-ff(bO), 



^i/(b.)+ 

therefore perform 

a) ^ + I 

Since Ci, Cia are positive we obtain an equivalent entropy 
approximation with c'^ > and c^,c^^ > 0. A straight 
forward bookkeeping ensures that and q do not 
change. 



Another concave approximation is to seek a setting of pa- 
rameters Ca > 0,Ci,Cia > such that the approximation H is 
as close as possible to the non-convex Bethe approximatiorj^ 
Given the equations in Definition [T] connecting the parameters 
Ca,Ci, Cia to Cq and Ci, the space of admissible solutions must 
satisfy the following equations: 

Ci+ Y {ca+ ^ Cjq) = 1, i^l,...,n 

aeN{i) jeN{a)\i 
Ci, Cia ^ Ca ^ 0. 

Among all possible admissible solutions we choose the one in 
which Ca is as uniform as possible, i.e., we apply Laplace's 
principle of insufficient reasoning. The criterion function, 
therefore, minimizes: 



mill y {ca 

Ci ,Cia -Ca^admissihle ^ — ^ 



1)^ 



(38) 



i£N(a) 



and divide it to two cases: 



which is a least-squares criteria for uniformity of Ca- We 
refer to the two least-squares scheme as L2 convex free 
energy approximation. In an earlier work fT9|, we also used 
the maximum entropy approach where the criterion function 
minimizes Ca Inco,. Further investigation for constructing 
good convex free energy approximations can be found in |39|. 

The desire towards uniformity, besides being used exten- 
sively in probabilistic settings, is motivated by the success of 
the Bethe free energy where Ca — 1- The Bethe free energy is 
non-convex for factor graphs with cycles, thus is not a member 
of the convex free energies, but empirical evidence suggest that 
when BP converges the marginals are surprisingly good. For 
Bethe free energy Ca — I over all factor nodes a — hence our 
proposal to strive for uniformity over the space of admissible 
solutions. In some sense we are attempting to "convexify" the 
Bethe free energy, although this is not being done directly. 

-A similar idea was independently derived by Nir Friedman and his 
collaborators — personal communication. 
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Appendix E 
Incorporating zero potentials 

A particularly important class of factors are those with zero 
potentials. These type of potentials are used, for example, 
in defining error-correcting codes. Note that if one or more 
of the factor potentials ipa{xa) or local potentials <j){xi) are 
equal to zero, then the overall probability of states which 
contain these configurations is zero, namely p{xa) — 
or p{xi) = respectively. This restriction on the marginal 
probabilities implicitly appears in the variational programs 
using the convention OlnO = and — xlnO = oo for x > 0. 

Recall that the variational approach seeks a distribution 
p{xi, ■■■,Xn) which is as close as possible, in relative entropy 
terms, to the product Jli '^i(^i) Hq V'q(xq). Expanding the 
relative entropy produces the free energy: 



9 



(P) = ^a{Xa)p{Xa) 



\e^{Xi)p{Xi) - H{p), 



where 9a — ~\ntpa, Qi = — In^i, and p(xq), p{xi) are 
the marginal probabilities, and H{p) is the entropy function. 
Since — xlnO — oo whenever x > 0, the zero potential 
■00 (xq) = constraints p € dom{g) if and only if ^(xq) 0, 
Likewise, 4>i{xi) = constrains p e dom{g) whenever 
p{xi) — 0. Following the above, the inference program which 
corresponds to the free energy minimization is well defined 
for zero potentials when we consider the domain of the free 
energy, and takes the form minpgc(o„i(g) g{p)- In spite the 
mathematical difficulty introduced by using zero potentials, 
it makes no difference from algorithmic perspective, since the 
optimal distribution p* is the normalization of the product 
of potentials, p* cx JJ - tpa and it respects the domain 

constraint. 

The same behavior appears in the variational approach for 
MAP assignment, where one seeks vector x* which maximize 
the energy Y\i4'i{xi)Y\a''Pa{^a)- This task is described by 
the linear function 

5(P) = X! ^a{Xa)p{Xa) + ^ 0i{x^)p{x^), 

whereas the zero potentials of the form <j)i{xi) = or 
i'ai^a) — constraints p e dom{g) if and only if p{xi) ~ 
or p(xq) ~ respectively. Again, this mathematical nuance 
makes no difference from algorithmic perspective, since the 
optimal distribution p* is the a zero-one distribution, i.e. 
p*{x*) = 1 and for every x 7^ x* holds p*(x) — 0. 

The framework of incorporating zero potentials in the 
domain of the optimized program also corresponds to the 
approximate inference and LP-relazation described by the 
minimization of the function 

5(b) ^Ye,{xi)bi{x,) + ^ 9a{xa)ba{xa) - eH(h), 

whose domain is constrained by zero potentials, namely 
4>i{xi) = or V'a(xa) = Constraints b G dom{g) if and 
only if bi{xi) = or ^^(xq) = respectively. The domain 
constrains are inherited by the norm-product algorithm where 
we represent g{h) in the form feO^) + J2i ^c,i(t>) described in 
eqns. 10 11 In particular, b e dom{f^) only if 6q,(xq,) = 



whenever ipa{x.a) = 0, and b € dom{hf^i) only if hi{xi) = 
whenever (f>i{xi) = 0. This domain constraint do not affect the 
norm-product algorithm whose optimal beliefs are a (power) 
normalization of the potentials multiplied by the messages, 
described in eqn. 



31 and eqn. 32 



Therefore, in the norm- 
product optimization framework, zero potential i/;q(xq) = 
or (f)i{xi) = induces optimal beliefs satisfying 6* (x^) = 
or h*{xi) = respectively. 
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